Papers
Topics
Authors
Recent
Search
2000 character limit reached

URSA-GAN: Unified Domain Adaptation for ASR/SE

Updated 5 February 2026
  • URSA-GAN is a unified, domain-aware generative framework that adapts speech systems to unseen noise and channel conditions using dual noise and channel encoders.
  • It leverages a dual-embedding architecture and GAN-based synthesis with dynamic stochastic perturbation to effectively generate target-style speech from minimal unlabeled data.
  • Empirical results demonstrate significant performance gains, including up to a 20% CER reduction and enhanced PESQ scores across diverse noisy and channel-degraded benchmarks.

URSA-GAN is a unified, domain-aware generative framework designed to address cross-domain degradation in automatic speech recognition (ASR) and speech enhancement (SE) systems under mismatched noise and channel conditions. It leverages a dual-embedding architecture—comprising distinct noise and channel encoders—whose representations condition a generative adversarial network (GAN) to synthesize speech aligned with a target domain, even when only minimal unlabeled target-data is available. URSA-GAN also introduces a dynamic stochastic perturbation regularization technique to enhance generalization to unseen domains, and demonstrates substantial improvements over prior adaptation baselines in both ASR and SE metrics across challenging compound channel+noise scenarios (Wang et al., 4 Feb 2026).

1. Motivation and Problem Context

URSA-GAN is motivated by the empirical observation that pre-trained ASR and SE models (e.g., Whisper, DEMUCS) deteriorate significantly in the presence of unseen noise types or recording-channel variations. Existing adaptation techniques often require large volumes of labeled target-domain data or focus exclusively on either noise or channel shift, limiting their robustness in real-world heterogeneous settings. URSA-GAN addresses these deficiencies by unifying adaptation to noise and channel variations and equipping instance-level encoders for each, trained on minimal unlabeled data per target domain. These encodings subsequently drive a GAN-based speech generator capable of producing target-style data suitable for fine-tuning downstream ASR and SE models, reducing the reliance on labeled or paired data from the target domain.

2. Dual-Embedding Architecture

The core of URSA-GAN is the dual-embedding architecture, with two specialized encoders:

  • Noise Encoder The noise encoder, built on the BEATs backbone pre-trained on diverse acoustic events, is fine-tuned using the ESC-50 dataset to stabilize general non-speech noise representations. Each target utterance is treated as a distinct “noise class,” allowing embeddings to specialize without degrading pre-trained representations. For a target spectrogram XTX^T, the noise embedding is NT=Enoise(XT)N^T = E_{\text{noise}}(X^T). A dedicated noise reconstruction loss,

LNR=EXS,NT,CT[NTEnoise(G(XS,NT,CT))1],\mathcal{L}_{\mathrm{NR}} = \mathbb{E}_{X^S,N^T,C^T}\bigl[\|\,N^T - E_{\text{noise}}\bigl(G(X^S,N^T,C^T)\bigr)\|_1\bigr],

enforces preservation of noise details in generated samples.

  • Channel Encoder The channel encoder leverages the MFA-Conformer, pre-trained on the HAT corpus with multi-microphone recordings, which learns representations invariant to phonetic content and specific to microphone/channel characteristics. The channel embedding for a target spectrogram is CT=Echannel(XT)C^T = E_{\text{channel}}(X^T). Channel consistency is promoted via:

LCC=EXS,NT,CT[CTEchannel(G(XS,NT,CT))1].\mathcal{L}_{\mathrm{CC}} = \mathbb{E}_{X^S,N^T,C^T}\bigl[\|\,C^T - E_{\text{channel}}\bigl(G(X^S,N^T,C^T)\bigr)\|_1\bigr].

  • Embedding Functions Formalizing, one has NT=Enoise(XT)N^T = E_{\text{noise}}(X^T) and CT=Echannel(XT)C^T = E_{\text{channel}}(X^T).

Ablation studies indicate that both encoders and their corresponding losses are essential to the observed robustness gains.

3. GAN Generator, Conditioning, and Training Objectives

Generator and FiLM Conditioning

The generator GG follows an encoder–decoder structure with two down-sampling convolutional layers, nine ResNet blocks (with instance normalization, ReLU, and dropout), and two transposed convolutional up-sampling layers. Both noise and channel embeddings are fused into every ResNet block via feature-wise linear modulation (FiLM), with eight independent parameter sets for fine-grained adaptation. The affine FiLM parameters,

(W,b)=Linear(NT+CT),(W, b) = \text{Linear}(N^T + C^T),

modulate feature map FF as F=WF+bF' = W \odot F + b.

Discriminator

The discriminator DD comprises five convolutional layers (kernel 4×44\times 4), LeakyReLU activation, varying striding, no batch normalization, and outputs a scalar for “real vs. fake” discrimination. Spectral normalization and a gradient penalty are utilized for training stability.

Objective Functions

  • Adversarial Loss

LA(G,D)=EXT[logD(XT)]+EXS,NT,CT[log(1D(G(XS,NT,CT)))]\mathcal{L}_{A}(G,D) = \mathbb{E}_{X^T}[\log D(X^T)] + \mathbb{E}_{X^S,N^T,C^T}[\log(1 - D(G(X^S,N^T,C^T)))]

  • Patch-wise Contrastive Loss (PCL) For L=4L=4 intermediate layers, II query patches z^li\hat z_l^i from Gl(XG)G_l(X^G), positives zliz_l^i from Gl(XS)G_l(X^S), and JJ negatives are sampled. Each layer-specific head FlF_l projects features, and the contrastive loss is:

LPCL=l=1Li=1Iloge(z^lizli)/τe(z^lizli)/τ+j=1Je(z^lizlj)/τ,\mathcal{L}_{\mathrm{PCL}} = \sum_{l=1}^L \sum_{i=1}^I -\log \frac{e^{(\hat z_l^i \cdot z_l^i)/\tau}} {e^{(\hat z_l^i \cdot z_l^i)/\tau} + \sum_{j=1}^J e^{(\hat z_l^i \cdot z_l^j)/\tau}},

applied to both (XG,XS)(X^G,X^S) and (XG,XT)(X^G,X^T).

  • Noise and Channel Reconstruction Losses Defined as above for LNR\mathcal{L}_{\mathrm{NR}} and LCC\mathcal{L}_{\mathrm{CC}}.
  • Overall Loss

LOverall=LA+LPCL(XS)+LPCL(XT)+λNRLNR+λCCLCC,\mathcal{L}_{\mathrm{Overall}} = \mathcal{L}_A + \mathcal{L}_{\mathrm{PCL}}(X^S) + \mathcal{L}_{\mathrm{PCL}}(X^T) + \lambda_{\mathrm{NR}} \mathcal{L}_{\mathrm{NR}} + \lambda_{\mathrm{CC}} \mathcal{L}_{\mathrm{CC}},

with λNR=λCC=0.5\lambda_{\mathrm{NR}} = \lambda_{\mathrm{CC}} = 0.5.

4. Dynamic Stochastic Perturbation and Regularization

URSA-GAN introduces dynamic stochastic perturbation by injecting Gaussian noise into the target embeddings during training, aiming to regularize the generator and promote robustness to distributional shifts. For each target embedding e{NT,CT}e\in\{N^T,C^T\}, a perturbed embedding is formed as e~=e+ϵ\tilde e = e + \epsilon, with ϵN(0,σ2I)\epsilon \sim \mathcal{N}(0, \sigma^2 I) and σ[0.1,0.4]\sigma \in [0.1, 0.4] (empirically optimal at σ0.2\sigma\approx0.2). This strategy exposes the generator to a continuum of nearby domain variations, preventing overfitting to any single embedding realization. Ablation results confirm that omission of this perturbation causes systematic degradation across both ASR and SE metrics.

5. Training Procedure

URSA-GAN operates in an unpaired, domain-adaptation setting with the following protocol:

  • Data Regime
    • Source domain data XSX^S comprises clean or multi-source speech, typically using condenser-microphone recordings.
    • Target domain data XTX^T is limited to 40 unlabeled utterances per domain (e.g., HAT, VBD test sets).
    • Each epoch uses an equal ratio of source and target samples.
  • Optimization
    • 400 training epochs are conducted with Adam optimizer (learning rate 2×1042 \times 10^{-4}).
    • Discriminator: no batch normalization. Generator: instance normalization.
    • Patch-wise sampling with I=256I=256 queries per layer.
    • Loss coefficients: λNR=λCC=0.5\lambda_{\mathrm{NR}} = \lambda_{\mathrm{CC}} = 0.5.
  • Data Simulation After convergence, GG generates large synthetic datasets from full-source data with random target embedding perturbations, providing paired data for downstream ASR/SE fine-tuning.

6. Evaluation and Empirical Results

URSA-GAN was rigorously evaluated on compound and isolated domain shift scenarios across multiple benchmarks:

Scenario Baseline URSA-GAN Improvement
HAT-ESC (noise+channel, Whisper/DEMUCS) CER=32.43%, PESQ=1.99 CER=27.19%, PESQ=2.30 -16.16% / +15.58%
HAT (channel, Whisper Tiny) CER=10.24% CER=8.14% -20.51%
TAT (channel, Whisper Tiny) CER=12.76% CER=11.50% -9.87%
VBD (noise, DEMUCS) PESQ=3.05, STOI=95.2% PESQ=3.16, STOI=95.3% +0.11 / +0.1%
MOS (simulated speech HAT/TAT/VBD) 2.90/2.55/1.49 4.06/3.09/2.51 +1.16 / +0.54 / +1.02

Further, all Whisper variants benefit from 10–25% relative CER reduction post URSA-GAN adaptation. Statistical significance was assessed by Friedman and Nemenyi tests.

  • Ablation Studies
    • Removing channel or noise embeddings causes substantial performance regressions.
    • Omitting stochastic perturbation, noise reconstruction, or using alternative fusion (concat/add/shared FiLM) reduces performance.
    • BEATs outperforms WavLM and Whisper as noise encoder for both CER and PESQ.
  • Embeddings Visualization UMAP projections indicate well-separated clusters for novel noise and channel types, evidencing the discriminative utility of fine-tuned BEATs and MFA-Conformer embeddings.

7. Limitations and Future Directions

URSA-GAN requires careful GAN training (hyperparameter tuning, mode collapse mitigation). The use of large pre-trained encoders (BEATs, MFA-Conformer) imposes non-trivial training costs. Ablations confirm that both encoders and reconstruction losses are indispensable, with stochastic perturbation conferring additional robustness.

Anticipated research directions include:

  • Domain-aware or self-supervised pre-training for encoders,
  • Replacing the GAN backbone with diffusion or score-based models,
  • End-to-end joint optimization alongside ASR/SE objectives,
  • Extension to dynamically varying noise/channel and real-time deployment scenarios.

These avenues suggest that domain-aware generative adaptation, as operationalized by URSA-GAN, presents a comprehensive solution to mismatched environment robustness in speech systems while opening pathways to further improvements in generalization under real-world conditions (Wang et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to URSA-GAN.