Fast-ULCNet: Ultra Low-Complexity Speech Enhancement
- The paper introduces Fast-ULCNet, an ultra low-complexity architecture that replaces traditional GRU layers with FastGRNN cells and integrates a trainable complementary filter to maintain stability over long sequences.
- Empirical results show that Fast-ULCNet achieves nearly identical PESQ and SI-SDR scores to ULCNet on 10-second signals while reducing parameters by 50% and inference latency by over 30% on embedded platforms.
- The study highlights that explicit state regularization through the complementary filter is essential for mitigating state drift in fast recurrent units during extended audio processing.
Fast-ULCNet denotes an ultra low-complexity deep learning architecture for single-channel speech enhancement tailored to resource-constrained real-time environments. Derived from the state-of-the-art ULCNet pipeline, Fast-ULCNet adapts the recurrent layers by substituting traditional GRUs with FastGRNN units to dramatically reduce both parameter count and inference latency. The system further innovates with a trainable complementary filter ("Comfi-FastGRNN") to address stability issues in sequential processing of long audio signals. This architecture achieves competitive enhancement performance with substantially reduced computational and memory demands, making it suitable for embedded and edge applications (Larraza et al., 21 Jan 2026).
1. Architectural Composition
The original ULCNet is a pipeline combining depthwise separable convolutions, power-law compression, and recurrent modeling to achieve high-quality speech enhancement with ultra-low complexity. The input representation consists of real and imaginary STFT components subjected to power-law compression (exponent 0.3). Channel-wise feature reorientation operates with overlapping rectangular windows (1.5 kHz bandwidth, 0.33 overlap), reducing 513 frequency bins to approximately 342 channels.
The main convolution block applies four depthwise-separable 1×3 convolutions along the frequency axis, with a progressive increase in filters (32, 64, 96, 128) and max-pooling (factor 2) in the last three layers. A bidirectional GRU ("Freq-GRU," 64 units), followed by a 1×1 convolution, operates along frequency to provide inter-channel modeling. Subband temporal modeling is performed by two sequential blocks, each with two GRU layers (128 units each), processing temporal dependencies within each subband.
The network splits into two heads. Stage 1—the mask estimator—comprises two fully connected layers (257 units) yielding a real-valued magnitude mask. Stage 2 performs phase refinement via two 2D convolutions (32 filters of 1×3) and a pointwise convolution, producing real and imaginary CRM outputs to reconstruct the enhanced complex spectrogram.
Fast-ULCNet introduces two modifications to this topology:
- All GRU layers (the frequency-GRU and the four 128-unit temporal GRUs) are replaced with FastGRNN cells configured to the same hidden dimensionality.
- Optionally, a trainable complementary filter ("Comfi-FastGRNN") is affixed to each FastGRNN, regularizing hidden state dynamics during long-sequence inference.
All other architectural elements are retained without alteration (Larraza et al., 21 Jan 2026).
2. FastGRNN: Formulation and Justification
FastGRNN (Fast, Accurate, Stable GRU-like Recurrent Network) employs a streamlined mechanism for recurrent sequence modeling. It uses shared weight matrices for both gating and state update, and introduces two trainable scalars, , to modulate the hidden state evolution.
Let and denote the input and previous hidden state at time , respectively. FastGRNN computes:
where and are the shared weight matrices, are biases, denotes the sigmoid function, and indicates elementwise multiplication.
FastGRNN achieves substantial parameter and computation savings (shared weights, scalar gates), directly leading to a 50% reduction in the total parameter count when replacing GRUs, and an 18% reduction in MACs across the full network. In Fast-ULCNet, this reduces the model footprint from 0.685 M to 0.338 M parameters and lowers MACs from 2.057 M to 1.691 M (Larraza et al., 21 Jan 2026).
3. State Drift and the Comfi-FastGRNN Filter
Despite formal stability guarantees during training, FastGRNN exhibits an internal state drift problem during inference with long input sequences (). Empirical observations show that the mean magnitude of can drift upward, resulting in perceptible degradation of speech enhancement quality, particularly in metrics such as PESQ and SI-SDR on longer signals.
To counteract this, a trainable complementary filter is introduced:
where determines the memory decay (typical initialization: ), and is a learnable steady-state bias. Both and are optimized jointly with all other model parameters. This formulation is a first-order trainable IIR filter, ensuring that the latent state remains bounded around for arbitrarily long sequences (Larraza et al., 21 Jan 2026). A plausible implication is that similar state-regularization filters could generalize to other RNN architectures employed in long audio sequence modeling.
4. Training Methodology and Hyperparameters
Training employs the Interspeech 2020 DNS Challenge synthetic non-reverberant dataset, comprising 1000 hours of noisy 10 s speech mixtures sampled at 16 kHz with SNR drawn uniformly from [–10, 30] dB. An 85/15 train/validation split is used, with evaluation on both canonical 10 s and synthetically concatenated 90 s test sets.
Inputs undergo a 32 ms STFT (16 ms hop, 512-point FFT) and power-law compression (exponent 0.3). The loss function jointly penalizes per-bin magnitude and complex difference:
Optimization uses Adam with initial learning rate , gradient clipping at norm 3.0, with adaptive schedule (rate halved after 3 epochs with no validation improvement, early stopping at 5). No explicit data augmentation beyond randomized SNR mixing is performed. Each training epoch encompasses approximately 4000 steps (batch size 32), and validation 1000 steps (Larraza et al., 21 Jan 2026).
5. Empirical Performance and Complexity Metrics
Objective enhancement metrics were evaluated on both 10 s and 90 s signals. The comparison of ULCNet, Fast-ULCNet (plain), and Fast-ULCNet is summarized as follows:
| Test Length | Model | OVRLMOS | SIGMOS | BAKMOS | PESQ | SI-SDR (dB) |
|---|---|---|---|---|---|---|
| 10 s | ULCNet | 3.10 | 3.39 | 3.96 | 2.62 | 16.24 |
| Fast-ULCNet | 3.09 | 3.39 | 3.95 | 2.51 | 15.99 | |
| Fast-ULCNet | 3.09 | 3.39 | 3.97 | 2.50 | 16.01 | |
| 90 s | ULCNet | 3.09 | 3.39 | 3.95 | 2.66 | 16.89 |
| Fast-ULCNet | 2.93 | 3.39 | 3.62 | 2.24 | 13.58 | |
| Fast-ULCNet | 3.10 | 3.39 | 3.99 | 2.51 | 16.48 |
On 10 s segments, Fast-ULCNet sacrifices only ~0.11 PESQ points and 0.25 dB SI-SDR versus the original ULCNet. On 90 s segments, unfiltered Fast-ULCNet exhibits severe performance degradation due to state drift (PESQ ↓0.42, SI-SDR ↓3.31 dB). The Comfi-FastGRNN filter restores performance to within 0.15 PESQ points and 0.4 dB SI-SDR of the baseline, matching DNSMOS submetrics (Larraza et al., 21 Jan 2026).
Computational statistics on embedded platforms are:
| Model | Params (M) | MACs (M) | RTF | RTF |
|---|---|---|---|---|
| ULCNet | 0.685 | 2.057 | 0.976 | 0.927 |
| Fast-ULCNet | 0.338 | 1.691 | 0.657 | 0.604 |
RTF (real-time factor) indicates that Fast-ULCNet is 33–35% faster, confirming suitability for resource-constrained hardware.
6. Trade-Offs, Limitations, and Future Directions
The main trade-off of Fast-ULCNet lies in lower complexity versus a small but observable reduction in enhancement quality on short signals. For long sequences, stability collapses without explicit state control, making the complementary filter indispensable for parity with ULCNet. The key insight is that aggressive parameter reduction through FastGRNN is only feasible if coupled with explicit state regularization; architectural compactness alone does not guarantee inference stability.
Limitations include marginally inferior PESQ and SI-SDR relative to the original model, even with the complementary filter. Future directions highlighted include exploration of more sophisticated filters or adaptive gating, richer perceptual evaluation (e.g., listening tests, STOI), and integration of Comfi-FastGRNN into other lightweight speech-processing tasks such as dereverberation and separation, to assess generality (Larraza et al., 21 Jan 2026).
7. Position Within Speech Enhancement Models
ULCNet [Shetu et al., ICASSP 2024] established depthwise separable convolution and compact recurrent layers as a standard for ultra-low complexity speech enhancement. Fast-ULCNet demonstrates that FastGRNN-based recurrent modeling, especially with state filtering mechanisms, can further halve parameter count and reduce latency without significant performance loss, provided explicit drift countermeasures are present. A plausible implication is that the filter-based methodology may generalize to other domains where lightweight, long-sequence RNNs are used, provided the recurrent cell lacks strong contraction properties. Fast-ULCNet sets a baseline for embedded speech enhancement under severe resource constraints, balancing complexity, stability, and empirical enhancement quality (Larraza et al., 21 Jan 2026).