SuDoRM-RF Network: Efficient USS & RF Fronthaul
- The network is a deep learning-based USS model that integrates time-domain separation with RF and clock-data multiplexing using U-ConvBlocks.
- It employs multi-resolution convolutions and sampling-frequency-independent layers with a permutation-invariant loss to robustly separate various signal sources.
- In RAN fronthaul, SuDoRM-RF enables sub-100 fs jitter and high data rates, ensuring precise synchronization and low-latency transmission.
SuDoRM-RF Network is a designation for multiple high-performance, resource-efficient architectures spanning universal audio source separation and, in recent systems, synchronous clock and RF carrier distribution for radio access network fronthaul. The most widely studied application is deep learning-based USS (Universal Sound Separation) in the time domain, though recent work has extended the SuDoRM-RF methodology to RF/clock/data analog-multiplexed transmission. In separation networks, SuDoRM-RF employs successive downsampling and resampling of multi-resolution features via U-ConvBlocks, depthwise convolutions, and a streamlined encoder–separator–decoder pipeline. The network is notable for computational efficiency, scalability to arbitrary numbers and types of sources, and recent extensions enabling sampling-frequency-independence via continuous-time kernel parameterization (Nakamura et al., 2023, Tzinis et al., 2021, Tzinis et al., 2020). In RAN fronthaul, SuDoRM-RF denotes a fiber-optic system integrating data, clock, and RF carrier synchrony via comb transmission and clock-phase caching (Clark et al., 6 Jun 2025).
1. Architectural Principles and Operator Pipeline
SuDoRM-RF for universal sound separation is a fully convolutional, purely time-domain, end-to-end model based on an encoder–mask predictor–decoder design. The input is a mono mixture . The encoder applies a Conv1D followed by ReLU, generating a pseudo time–frequency representation , where .
The mask predictor employs stacked U-ConvBlocks, each structured as a five-level U-Net in time-resolution, alternating depthwise convolutional downsampling (stride 2) and nearest-neighbor upsampling, with skip connections at each level. The output is non-negative time–frequency masks .
Separation proceeds by element-wise multiplication of each mask with , stacking the resulting across the channel axis, and passing to a transposed Conv1D (CM→M, kernel size , stride ) for waveform synthesis.
For indeterminate sources, the network employs permutation-invariant loss:
with
2. U-ConvBlock Multi-Resolution Feature Mechanisms
Each U-ConvBlock is optimized for temporal receptive field expansion and feature aggregation while maintaining parameter and FLOP economy. Block-wise operations include channel expansion (1×1 Conv1D + PReLU + LN), initial depthwise Conv1D, successive depthwise strided Conv1D downsamplings (typically levels, stride 2), and sequential nearest-neighbor upsampling with additive skip. The lowest-resolution features integrate long-term context; upsampled coarser features are added to finer, original-scale features.
Outputs are collapsed back to channels via pointwise conv, summed residually:
Aggregated features retain fine temporal information while integrating multi-scale context efficiently (Tzinis et al., 2021, Tzinis et al., 2020).
3. Sampling-Frequency-Independence via SFI Convolutional Layers
SuDoRM-RF standard kernels are intrinsically sampling-rate-sensitive, complicating deployment across heterogenous datasets and downstream tasks. The Sampling-Frequency-Independent (SFI) extension addresses this by parameterizing Conv1D kernels as continuous-time filter prototypes :
where (center frequency), (bandwidth), and (phase) are learned parameters.
Digital FIR weights are synthesized at runtime via least-squares approximation over sampled digital frequencies, matching analog prototypes within . For a target , encoder and decoder weights and Conv1D stride/kernel sizes are recalculated to preserve temporal frame duration:
This allows the mask predictor and all internal network logic to remain invariant in time resolution across disparate sampling rates (Nakamura et al., 2023).
4. Quantitative Performance and Resource Analysis
SuDoRM-RF achieves state-of-the-art SI-SDRi metric performance with an order-of-magnitude lower computational resource footprint compared to ConvTasNet, DPRNN, Two-Step TDCN, and Demucs across standard datasets. The 1.0× configuration (2.7 M parameters, 2.5 GFLOPs, 0.8 GB RAM) attains 17.0 dB SI-SDRi on speech and 8.4 dB on non-speech. The smallest 0.25× variant (0.8 M parameters, 1.0 GFLOPs) retains 13.4 dB speech SI-SDRi (Tzinis et al., 2021, Tzinis et al., 2020).
In SFI evaluation (FUSS48k mixtures), the proposed SFI-SuDoRM-RF matches or exceeds signal-resampling baselines across ∈ [8, 48] kHz, remaining constant in SI-SDR and SI-SDR as diverges from training. At = 8 kHz, SFI-SuDoRM-RF surpasses “best” resampling baseline by up to 0.8 dB. Signal-resampling degrades separation (up to 1.5 dB loss at lowest ), while SFI maintains stability and performance (Nakamura et al., 2023).
| Model | SI-SDRi (Speech) | GFLOPs | Params (M) | Mem (GB) |
|---|---|---|---|---|
| SuDoRM-RF 1.0× | 17.0 | 2.45 | 2.7 | 0.79 |
| ConvTasNet | 15.3 | 5.16 | 5.0 | 0.61 |
| DPRNN | 18.8 | 48.81 | 2.6 | 2.27 |
5. Causality and Real-Time Variants
Causal operation for real-time applications (C-SuDoRM-RF++) swaps standard Conv1D and DWConv1D for causal variants (left-padding only), eliminates normalization to minimize buffering, and re-sizes internal channel widths and kernel lengths for depth compensation. The real-time causal model (B=8, =5) achieves 10.1 dB SI-SDRi in 88 ms per 1 s audio snippet, running >10× faster than real time on conventional CPUs (Tzinis et al., 2021).
6. SuDoRM-RF in Radio Access Network Fronthaul
In RAN contexts, SuDoRM-RF denotes the synchronous clock and RF carrier transmission system integrating clock synchronisation, RF carrier generation (25 GHz), and clock-synchronised data (2.5 Gb/s) on a single fiber, realized via optical frequency combs and clock-phase caching. The system features:
- Menhir optical frequency comb (=2.5 GHz) to generate RF tones.
- WDM (200 GHz channel spacing) and analog filtering (BPF/LPF) for service chromatic isolation.
- Clock-phase caching feedback for <6.7 ps RMS wander over 16h bidirectional links.
- Sub-100 fs jitter on 25 GHz RF carrier; BER below for 2.5 Gb/s data (Clark et al., 6 Jun 2025).
This architecture simultaneously meets stringent RAN fronthaul specifications: <100 μs latency, <100 ps synchronization, multi-10 Gb/s bandwidth, and cm-level positioning requirements.
7. Impact and Significance
SuDoRM-RF advances USS deployment by combining efficient receptive field scaling (via multi-resolution convolutional blocks), near-minimal parameter/FLOP complexity, sampling-rate agnosticism through SFI layers, and high-quality permutation-invariant separation losses. Resource analysis demonstrates threefold parameter and computational reduction versus nearest state-of-the-art, with robust generalization to variable source types and count. In fronthaul transmission, SuDoRM-RF consolidates clock, RF, and data paths, offering stable synchronization and carrier delivery with minimal hardware at distributed radio units, directly fulfilling next-generation requirements for 5G/6G convergence.
A plausible implication is that SFI layer design and multi-resolution convolutional blocks may become baseline methodologies for future USS architectures targeting edge deployment and cross-application source separation. Similarly, in RAN contexts, fiber-multiplexed SuDoRM-RF may define clock/RF/data infrastructure standards for ultra-low-latency wireless positioning and sensing (Nakamura et al., 2023, Tzinis et al., 2021, Tzinis et al., 2020, Clark et al., 6 Jun 2025).