FastGRNN: Efficient Tiny RNN Architecture

Updated 22 January 2026

FastGRNN is an efficient recurrent network that employs a vector-valued, input-dependent gate with shared weights to enhance stability and reduce computational complexity.
The architecture integrates compression techniques such as low-rank decomposition, sparsity, and quantization, achieving models as small as 1 KB without sacrificing accuracy.
Empirical evaluations show FastGRNN offers 2–4× weight reduction and significantly lower latency, making it ideal for IoT and embedded deployments.

FastGRNNs (Fast, Accurate, Stable, and Tiny Gated Recurrent Neural Networks) are a class of efficient recurrent neural architectures designed to address the limitations of standard RNNs, GRUs, and LSTMs in terms of stability, model size, computational complexity, and deployment on resource-constrained devices. FastGRNN achieves gated, expressive temporal modeling using weight-sharing and minimal parameterization, enabling kilobyte-scale models that match or exceed the predictive performance of conventional gated architectures, while supporting deployment on microcontrollers and embedded systems lacking hardware floating-point support (Kusupati et al., 2019, Larraza et al., 21 Jan 2026).

1. FastGRNN Architecture and Gating Mechanism

FastGRNN extends FastRNN, which incorporates a scalar, learned residual connection to stabilize standard RNNs. The key innovation of FastGRNN is the replacement of the scalar residual with a vector-valued, input- and state-dependent gate, while reusing the same input and hidden weight matrices for both gating and state-updating operations.

Let $x_t\in\mathbb R^d$ denote the input, $h_{t-1}\in\mathbb R^h$ the previous hidden state, with shared weights $W\in\mathbb R^{h\times d}$ , $U\in\mathbb R^{h\times h}$ and biases $b_z,b_h\in\mathbb R^h$ . The FastGRNN cell update is:

$z_t = \sigma(Wx_t + Uh_{t-1} + b_z) \ \tilde{h}_t = \tanh(Wx_t + Uh_{t-1} + b_h) \ h_t = [\zeta (1 - z_t) + \nu]\odot \tilde{h}_t + z_t \odot h_{t-1}$

where $\sigma$ denotes the sigmoid function, $\odot$ is elementwise multiplication, and scalars $\zeta,\nu\in[0,1]$ are learned (Kusupati et al., 2019, Larraza et al., 21 Jan 2026). This design sharply contrasts with GRUs, which perform three independent affine transformations per time-step (for update gate, reset gate, and candidate state), each with its own weight matrix.

Cell Type	Weight Matrices	Multiplications per Step	Learned Scalars
GRU	$W_z$ , $W_r$ , $W_h$ , $U_z$ , $U_r$ , $U_h$	$3(dh + h^2)$	0
FastGRNN	$W$ , $U$ (shared)	$2(dh + h^2)$	$\zeta, \nu$

By leveraging shared weights and two learned scalars, FastGRNN reduces the parameter count to roughly one-third of a GRU, achieving 2–4× reductions in weights and computation per time-step with comparable accuracy.

2. Model Compression: Low-Rank, Sparsity, and Quantization

FastGRNN supports aggressive compression via:

Low-rank decomposition: $W= W^1 (W^2)^\top,\ U= U^1 (U^2)^\top$ with $W^1\in\mathbb{R}^{h\times r_w}$ , $W^2\in\mathbb{R}^{d\times r_w}$ and similarly for $U$ , where $r_w, r_u$ are selected for the desired trade-off between accuracy and size.
Sparsity: Hard thresholding of weights in $W^i$ , $U^i$ sustains only the largest entries.
Quantization: Non-zero weights are quantized to 8-bit integers; $\tanh$ and sigmoid nonlinearities are replaced by piecewise-linear approximations, allowing fast, integer-only inference.

The complete compression pipeline is staged into (1) unconstrained low-rank optimization, (2) iterative hard thresholding, and (3) support-freeze fine-tuning (Kusupati et al., 2019). This realizes models as small as 1 KB, suitable for microcontrollers with just kilobytes of RAM and flash.

3. Computational Complexity and Latency

The shared-weight, single-gate structure of FastGRNN yields superior computational efficiency and reduced memory footprint relative to gated RNNs:

Parameter comparison: For hidden size $h$ , input dimension $d$ , GRU requires $3(dh + h^2 + h)$ parameters; FastGRNN requires $(dh + h^2) + 2h$ parameters (biases for $z_t$ and $\tilde{h}_t$ ), and two scalars.
Operational complexity: Per time-step, FastGRNN executes $2(dh + h^2)$ multiplications compared to $3(dh + h^2)$ for GRU.

Empirical evidence from the Fast-ULCNet study demonstrates:

Model	Params (M)	MACs (M)	RTF@Pi3	RTF@ARM
ULCNet	0.685	2.057	0.976	0.927
Fast-ULCNet	0.338	1.691	0.657	0.604

A ≈51 % reduction in parameters and ≈33-35 % reduction in real-time factor (RTF) is observed on embedded CPUs (Larraza et al., 21 Jan 2026). FastGRNN-LSQ models can be up to 35× smaller than dense RNN baselines (Kusupati et al., 2019).

4. Internal-State Drift and Long-Horizon Stability

In long unrolled sequences, FastGRNN-hidden states may drift as time progresses, manifesting as increased hidden-state norms and degraded task metrics (e.g., PESQ, SI-SDR in speech processing). This arises because the coefficients in the update equation do not enforce a strict contractive property ( $\alpha_t + z_t = 1$ is not enforced), enabling accumulation:

$h_t = \alpha_t \odot \tilde{h}_t + z_t \odot h_{t-1},\quad \alpha_t = \zeta (1-z_t) + \nu$

Empirical evaluation on 90 s concatenated speech inputs reveals a rapid $\ell_1$ norm increase in $h_t$ and a loss of up to 0.4 PESQ points, unless a correction is applied (Larraza et al., 21 Jan 2026). This drift is not evident during short-horizon training (e.g., 10 s), but becomes critical in deployment.

5. Trainable Complementary Filter for Drift Mitigation

Fast-ULCNet introduces a trainable one-pole complementary filter (“Comfi-FastGRNN”) to stabilize the FastGRNN hidden state on long inputs. The update is:

$h_{t,\mathrm{comfi}} = \gamma h_t + (1-\gamma) \lambda$

where $\gamma$ and $\lambda$ are learned scalars. This filter pulls the hidden state toward a learned reference $\lambda$ , preventing unbounded growth. Integration requires only two additional trainable parameters per FastGRNN layer, with optimization occurring jointly via backpropagation.

Empirical ablation shows that this correction fully restores long-term performance (SI-SDR/quality metrics) of FastGRNN models to GRU or original ULCNet levels with negligible computational cost. This suggests that state-drift is a tractable artifact of the unconstrained FastGRNN update and can be compensated with minimal architectural change (Larraza et al., 21 Jan 2026).

6. Empirical Performance and Evaluations

FastGRNN cells, both stand-alone and in Fast-ULCNet, match or surpass GRU/LSTM accuracy across a variety of tasks with substantially fewer parameters:

Model	OVRLMOS	SIGMOS	BAKMOS	PESQ	SI-SDR (dB)
ULCNet	3.10	3.39	3.96	2.62	16.24
Fast-ULCNet	3.09	3.39	3.95	2.51	15.99
Fast-ULCNet_comfi	3.09	3.39	3.97	2.50	16.01

On extended 90 s audio, drift-induced degradation in Fast-ULCNet is fully eliminated by the complementary filter extension (Larraza et al., 21 Jan 2026). For IoT tasks, FastGRNN-LSQ achieves comparable accuracy to GRU/LSTM with 2–4× model-size reduction and 10–100× lower latency (Kusupati et al., 2019).

7. Applications, Deployment, and Training Considerations

FastGRNN is particularly suited to resource-constrained endpoints (IoT, microcontrollers). Demonstrated deployments include:

Wake-word detection (“Hey Cortana”) with 1 KB flash models achieving F1 = 98.19 % on Arduino Uno.
Speech enhancement (Fast-ULCNet) with 0.338 M parameters and 33 % real-time latency reduction compared to GRU-based ULCNet on ARM targets.
Integer-only inference on MCUs without FPUs, using quantized weights and piecewise-linear nonlinearities (Kusupati et al., 2019, Larraza et al., 21 Jan 2026).

Training aligns with standard batch SGD/Adam protocols, with staged compression when compression is desired. No explicit training instabilities are reported, including in variants with state-correction filters. Drift is unobservable on standard-length training sequences and is addressed only at inference time via complementary filtering.

References

(Kusupati et al., 2019) FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network
(Larraza et al., 21 Jan 2026) Fast-ULCNet: A fast and ultra low complexity network for single-channel speech enhancement

Markdown Report Issue Upgrade to Chat

References (2)

FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network (2019)

Fast-ULCNet: A fast and ultra low complexity network for single-channel speech enhancement (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FastGRNNs.