URSA-GAN: Unified Domain Adaptation for ASR/SE
- URSA-GAN is a unified, domain-aware generative framework that adapts speech systems to unseen noise and channel conditions using dual noise and channel encoders.
- It leverages a dual-embedding architecture and GAN-based synthesis with dynamic stochastic perturbation to effectively generate target-style speech from minimal unlabeled data.
- Empirical results demonstrate significant performance gains, including up to a 20% CER reduction and enhanced PESQ scores across diverse noisy and channel-degraded benchmarks.
URSA-GAN is a unified, domain-aware generative framework designed to address cross-domain degradation in automatic speech recognition (ASR) and speech enhancement (SE) systems under mismatched noise and channel conditions. It leverages a dual-embedding architecture—comprising distinct noise and channel encoders—whose representations condition a generative adversarial network (GAN) to synthesize speech aligned with a target domain, even when only minimal unlabeled target-data is available. URSA-GAN also introduces a dynamic stochastic perturbation regularization technique to enhance generalization to unseen domains, and demonstrates substantial improvements over prior adaptation baselines in both ASR and SE metrics across challenging compound channel+noise scenarios (Wang et al., 4 Feb 2026).
1. Motivation and Problem Context
URSA-GAN is motivated by the empirical observation that pre-trained ASR and SE models (e.g., Whisper, DEMUCS) deteriorate significantly in the presence of unseen noise types or recording-channel variations. Existing adaptation techniques often require large volumes of labeled target-domain data or focus exclusively on either noise or channel shift, limiting their robustness in real-world heterogeneous settings. URSA-GAN addresses these deficiencies by unifying adaptation to noise and channel variations and equipping instance-level encoders for each, trained on minimal unlabeled data per target domain. These encodings subsequently drive a GAN-based speech generator capable of producing target-style data suitable for fine-tuning downstream ASR and SE models, reducing the reliance on labeled or paired data from the target domain.
2. Dual-Embedding Architecture
The core of URSA-GAN is the dual-embedding architecture, with two specialized encoders:
- Noise Encoder The noise encoder, built on the BEATs backbone pre-trained on diverse acoustic events, is fine-tuned using the ESC-50 dataset to stabilize general non-speech noise representations. Each target utterance is treated as a distinct “noise class,” allowing embeddings to specialize without degrading pre-trained representations. For a target spectrogram , the noise embedding is . A dedicated noise reconstruction loss,
enforces preservation of noise details in generated samples.
- Channel Encoder The channel encoder leverages the MFA-Conformer, pre-trained on the HAT corpus with multi-microphone recordings, which learns representations invariant to phonetic content and specific to microphone/channel characteristics. The channel embedding for a target spectrogram is . Channel consistency is promoted via:
- Embedding Functions Formalizing, one has and .
Ablation studies indicate that both encoders and their corresponding losses are essential to the observed robustness gains.
3. GAN Generator, Conditioning, and Training Objectives
Generator and FiLM Conditioning
The generator follows an encoder–decoder structure with two down-sampling convolutional layers, nine ResNet blocks (with instance normalization, ReLU, and dropout), and two transposed convolutional up-sampling layers. Both noise and channel embeddings are fused into every ResNet block via feature-wise linear modulation (FiLM), with eight independent parameter sets for fine-grained adaptation. The affine FiLM parameters,
modulate feature map as .
Discriminator
The discriminator comprises five convolutional layers (kernel ), LeakyReLU activation, varying striding, no batch normalization, and outputs a scalar for “real vs. fake” discrimination. Spectral normalization and a gradient penalty are utilized for training stability.
Objective Functions
- Adversarial Loss
- Patch-wise Contrastive Loss (PCL) For intermediate layers, query patches from , positives from , and negatives are sampled. Each layer-specific head projects features, and the contrastive loss is:
applied to both and .
- Noise and Channel Reconstruction Losses Defined as above for and .
- Overall Loss
with .
4. Dynamic Stochastic Perturbation and Regularization
URSA-GAN introduces dynamic stochastic perturbation by injecting Gaussian noise into the target embeddings during training, aiming to regularize the generator and promote robustness to distributional shifts. For each target embedding , a perturbed embedding is formed as , with and (empirically optimal at ). This strategy exposes the generator to a continuum of nearby domain variations, preventing overfitting to any single embedding realization. Ablation results confirm that omission of this perturbation causes systematic degradation across both ASR and SE metrics.
5. Training Procedure
URSA-GAN operates in an unpaired, domain-adaptation setting with the following protocol:
- Data Regime
- Source domain data comprises clean or multi-source speech, typically using condenser-microphone recordings.
- Target domain data is limited to 40 unlabeled utterances per domain (e.g., HAT, VBD test sets).
- Each epoch uses an equal ratio of source and target samples.
- Optimization
- 400 training epochs are conducted with Adam optimizer (learning rate ).
- Discriminator: no batch normalization. Generator: instance normalization.
- Patch-wise sampling with queries per layer.
- Loss coefficients: .
- Data Simulation After convergence, generates large synthetic datasets from full-source data with random target embedding perturbations, providing paired data for downstream ASR/SE fine-tuning.
6. Evaluation and Empirical Results
URSA-GAN was rigorously evaluated on compound and isolated domain shift scenarios across multiple benchmarks:
| Scenario | Baseline | URSA-GAN | Improvement |
|---|---|---|---|
| HAT-ESC (noise+channel, Whisper/DEMUCS) | CER=32.43%, PESQ=1.99 | CER=27.19%, PESQ=2.30 | -16.16% / +15.58% |
| HAT (channel, Whisper Tiny) | CER=10.24% | CER=8.14% | -20.51% |
| TAT (channel, Whisper Tiny) | CER=12.76% | CER=11.50% | -9.87% |
| VBD (noise, DEMUCS) | PESQ=3.05, STOI=95.2% | PESQ=3.16, STOI=95.3% | +0.11 / +0.1% |
| MOS (simulated speech HAT/TAT/VBD) | 2.90/2.55/1.49 | 4.06/3.09/2.51 | +1.16 / +0.54 / +1.02 |
Further, all Whisper variants benefit from 10–25% relative CER reduction post URSA-GAN adaptation. Statistical significance was assessed by Friedman and Nemenyi tests.
- Ablation Studies
- Removing channel or noise embeddings causes substantial performance regressions.
- Omitting stochastic perturbation, noise reconstruction, or using alternative fusion (concat/add/shared FiLM) reduces performance.
- BEATs outperforms WavLM and Whisper as noise encoder for both CER and PESQ.
- Embeddings Visualization UMAP projections indicate well-separated clusters for novel noise and channel types, evidencing the discriminative utility of fine-tuned BEATs and MFA-Conformer embeddings.
7. Limitations and Future Directions
URSA-GAN requires careful GAN training (hyperparameter tuning, mode collapse mitigation). The use of large pre-trained encoders (BEATs, MFA-Conformer) imposes non-trivial training costs. Ablations confirm that both encoders and reconstruction losses are indispensable, with stochastic perturbation conferring additional robustness.
Anticipated research directions include:
- Domain-aware or self-supervised pre-training for encoders,
- Replacing the GAN backbone with diffusion or score-based models,
- End-to-end joint optimization alongside ASR/SE objectives,
- Extension to dynamically varying noise/channel and real-time deployment scenarios.
These avenues suggest that domain-aware generative adaptation, as operationalized by URSA-GAN, presents a comprehensive solution to mismatched environment robustness in speech systems while opening pathways to further improvements in generalization under real-world conditions (Wang et al., 4 Feb 2026).