Papers
Topics
Authors
Recent
Search
2000 character limit reached

FastGRNN: Efficient Tiny RNN Architecture

Updated 22 January 2026
  • FastGRNN is an efficient recurrent network that employs a vector-valued, input-dependent gate with shared weights to enhance stability and reduce computational complexity.
  • The architecture integrates compression techniques such as low-rank decomposition, sparsity, and quantization, achieving models as small as 1 KB without sacrificing accuracy.
  • Empirical evaluations show FastGRNN offers 2–4× weight reduction and significantly lower latency, making it ideal for IoT and embedded deployments.

FastGRNNs (Fast, Accurate, Stable, and Tiny Gated Recurrent Neural Networks) are a class of efficient recurrent neural architectures designed to address the limitations of standard RNNs, GRUs, and LSTMs in terms of stability, model size, computational complexity, and deployment on resource-constrained devices. FastGRNN achieves gated, expressive temporal modeling using weight-sharing and minimal parameterization, enabling kilobyte-scale models that match or exceed the predictive performance of conventional gated architectures, while supporting deployment on microcontrollers and embedded systems lacking hardware floating-point support (Kusupati et al., 2019, Larraza et al., 21 Jan 2026).

1. FastGRNN Architecture and Gating Mechanism

FastGRNN extends FastRNN, which incorporates a scalar, learned residual connection to stabilize standard RNNs. The key innovation of FastGRNN is the replacement of the scalar residual with a vector-valued, input- and state-dependent gate, while reusing the same input and hidden weight matrices for both gating and state-updating operations.

Let xtRdx_t\in\mathbb R^d denote the input, ht1Rhh_{t-1}\in\mathbb R^h the previous hidden state, with shared weights WRh×dW\in\mathbb R^{h\times d}, URh×hU\in\mathbb R^{h\times h} and biases bz,bhRhb_z,b_h\in\mathbb R^h. The FastGRNN cell update is:

zt=σ(Wxt+Uht1+bz) h~t=tanh(Wxt+Uht1+bh) ht=[ζ(1zt)+ν]h~t+ztht1z_t = \sigma(Wx_t + Uh_{t-1} + b_z) \ \tilde{h}_t = \tanh(Wx_t + Uh_{t-1} + b_h) \ h_t = [\zeta (1 - z_t) + \nu]\odot \tilde{h}_t + z_t \odot h_{t-1}

where σ\sigma denotes the sigmoid function, \odot is elementwise multiplication, and scalars ζ,ν[0,1]\zeta,\nu\in[0,1] are learned (Kusupati et al., 2019, Larraza et al., 21 Jan 2026). This design sharply contrasts with GRUs, which perform three independent affine transformations per time-step (for update gate, reset gate, and candidate state), each with its own weight matrix.

Cell Type Weight Matrices Multiplications per Step Learned Scalars
GRU WzW_z, WrW_r, WhW_h, UzU_z, UrU_r, UhU_h 3(dh+h2)3(dh + h^2) 0
FastGRNN WW, UU (shared) 2(dh+h2)2(dh + h^2) ζ,ν\zeta, \nu

By leveraging shared weights and two learned scalars, FastGRNN reduces the parameter count to roughly one-third of a GRU, achieving 2–4× reductions in weights and computation per time-step with comparable accuracy.

2. Model Compression: Low-Rank, Sparsity, and Quantization

FastGRNN supports aggressive compression via:

  • Low-rank decomposition: W=W1(W2), U=U1(U2)W= W^1 (W^2)^\top,\ U= U^1 (U^2)^\top with W1Rh×rwW^1\in\mathbb{R}^{h\times r_w}, W2Rd×rwW^2\in\mathbb{R}^{d\times r_w} and similarly for UU, where rw,rur_w, r_u are selected for the desired trade-off between accuracy and size.
  • Sparsity: Hard thresholding of weights in WiW^i, UiU^i sustains only the largest entries.
  • Quantization: Non-zero weights are quantized to 8-bit integers; tanh\tanh and sigmoid nonlinearities are replaced by piecewise-linear approximations, allowing fast, integer-only inference.

The complete compression pipeline is staged into (1) unconstrained low-rank optimization, (2) iterative hard thresholding, and (3) support-freeze fine-tuning (Kusupati et al., 2019). This realizes models as small as 1 KB, suitable for microcontrollers with just kilobytes of RAM and flash.

3. Computational Complexity and Latency

The shared-weight, single-gate structure of FastGRNN yields superior computational efficiency and reduced memory footprint relative to gated RNNs:

  • Parameter comparison: For hidden size hh, input dimension dd, GRU requires 3(dh+h2+h)3(dh + h^2 + h) parameters; FastGRNN requires (dh+h2)+2h(dh + h^2) + 2h parameters (biases for ztz_t and h~t\tilde{h}_t), and two scalars.
  • Operational complexity: Per time-step, FastGRNN executes 2(dh+h2)2(dh + h^2) multiplications compared to 3(dh+h2)3(dh + h^2) for GRU.

Empirical evidence from the Fast-ULCNet study demonstrates:

Model Params (M) MACs (M) RTF@Pi3 RTF@ARM
ULCNet 0.685 2.057 0.976 0.927
Fast-ULCNet 0.338 1.691 0.657 0.604

A ≈51 % reduction in parameters and ≈33-35 % reduction in real-time factor (RTF) is observed on embedded CPUs (Larraza et al., 21 Jan 2026). FastGRNN-LSQ models can be up to 35× smaller than dense RNN baselines (Kusupati et al., 2019).

4. Internal-State Drift and Long-Horizon Stability

In long unrolled sequences, FastGRNN-hidden states may drift as time progresses, manifesting as increased hidden-state norms and degraded task metrics (e.g., PESQ, SI-SDR in speech processing). This arises because the coefficients in the update equation do not enforce a strict contractive property (αt+zt=1\alpha_t + z_t = 1 is not enforced), enabling accumulation:

ht=αth~t+ztht1,αt=ζ(1zt)+νh_t = \alpha_t \odot \tilde{h}_t + z_t \odot h_{t-1},\quad \alpha_t = \zeta (1-z_t) + \nu

Empirical evaluation on 90 s concatenated speech inputs reveals a rapid 1\ell_1 norm increase in hth_t and a loss of up to 0.4 PESQ points, unless a correction is applied (Larraza et al., 21 Jan 2026). This drift is not evident during short-horizon training (e.g., 10 s), but becomes critical in deployment.

5. Trainable Complementary Filter for Drift Mitigation

Fast-ULCNet introduces a trainable one-pole complementary filter (“Comfi-FastGRNN”) to stabilize the FastGRNN hidden state on long inputs. The update is:

ht,comfi=γht+(1γ)λh_{t,\mathrm{comfi}} = \gamma h_t + (1-\gamma) \lambda

where γ\gamma and λ\lambda are learned scalars. This filter pulls the hidden state toward a learned reference λ\lambda, preventing unbounded growth. Integration requires only two additional trainable parameters per FastGRNN layer, with optimization occurring jointly via backpropagation.

Empirical ablation shows that this correction fully restores long-term performance (SI-SDR/quality metrics) of FastGRNN models to GRU or original ULCNet levels with negligible computational cost. This suggests that state-drift is a tractable artifact of the unconstrained FastGRNN update and can be compensated with minimal architectural change (Larraza et al., 21 Jan 2026).

6. Empirical Performance and Evaluations

FastGRNN cells, both stand-alone and in Fast-ULCNet, match or surpass GRU/LSTM accuracy across a variety of tasks with substantially fewer parameters:

Model OVRLMOS SIGMOS BAKMOS PESQ SI-SDR (dB)
ULCNet 3.10 3.39 3.96 2.62 16.24
Fast-ULCNet 3.09 3.39 3.95 2.51 15.99
Fast-ULCNet_comfi 3.09 3.39 3.97 2.50 16.01

On extended 90 s audio, drift-induced degradation in Fast-ULCNet is fully eliminated by the complementary filter extension (Larraza et al., 21 Jan 2026). For IoT tasks, FastGRNN-LSQ achieves comparable accuracy to GRU/LSTM with 2–4× model-size reduction and 10–100× lower latency (Kusupati et al., 2019).

7. Applications, Deployment, and Training Considerations

FastGRNN is particularly suited to resource-constrained endpoints (IoT, microcontrollers). Demonstrated deployments include:

  • Wake-word detection (“Hey Cortana”) with 1 KB flash models achieving F1 = 98.19 % on Arduino Uno.
  • Speech enhancement (Fast-ULCNet) with 0.338 M parameters and 33 % real-time latency reduction compared to GRU-based ULCNet on ARM targets.
  • Integer-only inference on MCUs without FPUs, using quantized weights and piecewise-linear nonlinearities (Kusupati et al., 2019, Larraza et al., 21 Jan 2026).

Training aligns with standard batch SGD/Adam protocols, with staged compression when compression is desired. No explicit training instabilities are reported, including in variants with state-correction filters. Drift is unobservable on standard-length training sequences and is addressed only at inference time via complementary filtering.

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FastGRNNs.