MoE-Enhanced Speech-Conditioned LLM

Updated 5 January 2026

MoE-enhanced speech-conditioned LLMs are models that integrate sparsely activated expert modules to improve performance across speech-to-text tasks.
They utilize specialized architectures like Sparse Mixture of Projectors, Hierarchical Mixture of LoRA Experts, and prompt-aware connectors to balance accuracy and computational cost.
Empirical evaluations show reduced word error rates and improved scalability, demonstrating robustness in multi-modal and multi-domain applications.

A Mixture of Experts (MoE) Enhanced Speech-Conditioned LLM integrates sparse, selectively activated expert modules within or adjacent to a speech-to-text LLM to boost efficiency, scalability, and robustness across Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR), code-switching, multi-accent recognition, and related speech-language tasks. This approach exploits conditional computation and specialization among experts—often implemented as small, gated feedforward networks or low-rank adapters—while leveraging the generalization capacity and knowledge base of frozen or lightly-adapted LLMs (Cappellazzo et al., 20 May 2025, Cappellazzo et al., 5 Oct 2025, Mu et al., 12 Jul 2025, Mu et al., 2024).

1. Fundamental Architectures and Design Patterns

MoE-enhanced speech-conditioned LLMs interpose expert modules via several architectural variants, each tailored for specific problems in AVSR, code-switching, accent adaptation, and multi-task speech understanding:

Sparse Mixture of Projectors (SMoP) replaces the dense linear projection bridging audio/video encoders and LLMs with multiple expert projectors. A router (or routers) selects the top-K experts for each input token, activating only a sparse subset to maintain per-token computational cost (Cappellazzo et al., 20 May 2025).
Hierarchical Mixture of LoRA Experts (HDMoLE) generalizes low-rank adaptation by introducing multiple LoRA experts (one per accent, language, or domain) into every LLM layer. Hierarchical routing (global: accent/domain; local: per-layer) and dynamic thresholds control expert selection, mitigating catastrophic forgetting and adapting to fine-grained attributes (Mu et al., 2024, Mu et al., 12 Jul 2025).
Prompt-aware MoE Connectors and Task-Conditional MoE Fusion use multiple audio encoders or expert feature fusers, dynamically routed based on explicit task prompts or recognized domains to supply the most relevant features for ASR, speaker verification, or audio captioning (Shan et al., 21 Feb 2025).
Matryoshka MoE (MoME) harmonizes MoE routing with Matryoshka Representation Learning. Experts are reused across multiple token granularities, supporting elastic compression and dynamic tradeoff between runtime cost and accuracy (Cappellazzo et al., 5 Oct 2025).
MoE Connectors for Code-Switching bring together language-specialized and joint experts, orchestrated through a two-stage progressive training scheme. Initial alignment is language-specific, then joint activation and LoRA adaptation enable cross-lingual synergy (Zhang et al., 2024).

These architectures share common traits: frozen or lightly-adapted LLM backbones, expert selection via router networks (softmax + TopK or thresholded gating), and auxiliary regularizers to balance load and maintain expert diversity.

2. Mathematical Formulation and Routing Mechanisms

A central element of MoE-based speech-to-text models is the routing function $g(x)$ , which determines, for each input $x$ , which experts are activated and how their outputs are weighted. Let $x\in\mathbb{R}^{d_{in}}$ denote an input token (audio, video, or LLM hidden state).

Sparse Top-K Routing (SMoP, MoME):

$g(x)_i = \begin{cases} 1 & \text{if } i \in \mathrm{TopK}(W_r x + b_r) \ 0 & \text{otherwise} \end{cases} \;\; i=1, \ldots, E$

Softmax/Thresholded Gating (HDMoLE):

$P_g(x) = \mathrm{Softmax}(\mathrm{Router}_G(X_{in}))$

$P_l = \mathrm{Softmax}(W_g H_{in})$

Gates are thresholded:

$m_g^i = \mathbf{1}\{P_g^i \ge \tau_g\} \quad P_{ga}^i = \frac{m_g^i P_g^i}{\sum_j m_g^j P_g^j} \times \tau_g$

$P_{la}^i = \frac{m_l^i P_l^i}{\sum_j m_l^j P_l^j} \times \tau_l$

The final expert weight: $P_a^i = P_{ga}^i + P_{la}^i$

Expert Networks: Each expert $f_i(x)$ is typically a two-layer MLP or low-rank-update of base LLM weights.

$f_i(x) = W_{i,2} \sigma(W_{i,1} x + b_{i,1}) + b_{i,2}$

or for adapters:

$E_n(h) = W_{up}^n \: \phi(W_{down}^n h + b_{down}^n) + b_{up}^n$

Output Aggregation:

$y = \sum_{i=1}^E g(x)_i f_i(x)$

(Sparse when $K \ll E$ .)

This structure enables both per-token and per-layer conditional computation, reducing average compute while scaling model capacity.

3. Configurations, Training, and Specialization

Key configuration strategies include:

Disjoint vs. Joint Experts/Routers:
- DEDR (Disjoint Experts, Disjoint Routers): Optimal for AVSR with strong modality-specific encoders; uses separate routers and expert pools for audio and video (Cappellazzo et al., 20 May 2025).
- JEJR/JEDR: Share experts or routers across modalities, typically underperform compared to DEDR in robustness and token error rates.
MoE Training Objectives:
- Minimum is a language-modeling negative log-likelihood (autoregressive).
- Auxiliary losses: Load-balancing (Shazeer), $z$ -loss (router logit norm penalty), and regularization on LoRA or fusion weights.
- For code-switching or accented ASR, staged training is used: mono-domain alignment, then multi-domain/fine-tuning with all experts active (Zhang et al., 2024, Mu et al., 12 Jul 2025).
- Special mechanisms like IDIT (Insertion and Deletion of Interruption Token) enforce tight alignment of output tokens for code-mixed utterances (Zhang et al., 2024).
Granularity and Elastic Inference: MRL/Matryoshka pooling and MoME modules allow dynamic adjustment of token compression at inference, sharing routers/experts across granularities for effective cross-scale generalization (Cappellazzo et al., 5 Oct 2025).

4. Empirical Evaluations and Comparative Results

Benchmarking across AVSR, code-switching, multi-accent, and multi-task datasets demonstrates the consistent superiority of MoE-enhanced LLMs over dense or fixed-adapter baselines given similar or reduced computational budgets:

Model	Task	WER (LRS3/LibriSpeech/etc.)	Relative Reduction	Notes
Llama-SMoP DEDR	AVSR	0.96% (Llama-3.1 8B)	12–20% v. baseline	Best in all tested SNRs and noise conditions (Cappellazzo et al., 20 May 2025)
MoME (K=2,Nr=4,Ns=2)	AVSR	1.5% (LRS3, 4/2 pool)	37.5% vs. baseline	Consistent gains at all compression scales (Cappellazzo et al., 5 Oct 2025)
HDMoLE GER	Accented ASR	2.07% avg. (multi-accent)	67.35% vs. Whisper	Achieves nearly full-tune performance with 1/10 of parameters (Mu et al., 12 Jul 2025, Mu et al., 2024)
SC-LLM+MoE+IDIT	Code-switch ASR	7.76% MER	10–20% over baselines	Outperforms multi-billion parameter baselines (Zhang et al., 2024)
PaM	Multi-task	3.65% (Libri-C), 42.8% (SNV)	20–40% reductions	Surpasses all single/multi-encoder baselines (Shan et al., 21 Feb 2025)

All results reported trace directly to the respective arXiv sources.

5. Analysis of Efficiency, Scalability, and Robustness

Computational Cost: Sparse MoE variants such as SMoP maintain per-token FLOPs nearly identical to their dense counterparts by activating only $K$ of $E$ experts ( $K=2, E=4$ recommended) (Cappellazzo et al., 20 May 2025).
Parameter Efficiency: Methods like HDMoLE result in only ~10% of trainable parameters versus full fine-tuning, by restricting training to LoRA/expert weights and routers (Mu et al., 2024). MoME and related architectures use minimal adapter capacity for substantial accuracy gains (Cappellazzo et al., 5 Oct 2025).
Noise Robustness: Both SMoP-DEDR and MoME show lower word error rates under low SNR and additive babble compared to LLM-based and attention-fusion baselines. This suggests MoE routing enables dynamic capacity allocation where signal quality is low (Cappellazzo et al., 20 May 2025, Cappellazzo et al., 5 Oct 2025).
Adaptability: MoME architectures grant explicit control over runtime/accuracy tradeoffs by selecting token compression rates at inference, benefiting edge deployment (Cappellazzo et al., 5 Oct 2025).
Mitigation of Forgetting: HDMoLE's hierarchical routing and dynamic thresholding prevent any single expert from dominating, thus maintaining general-domain performance while specializing for domains such as accents (Mu et al., 2024).

6. Practical Implementation and Recommendations

Designing a high-performing MoE-enhanced speech-conditioned LLM requires:

Expert and Router Configuration:
- Prefer disjoint modality-specific experts/routers in multimodal (audio-video) settings.
- Use $E \approx 3$ –$4$ experts per modality and $K=2$ activated per token for SMoP or similar modules.
- For accented or multi-domain ASR, deploy one LoRA expert per domain and combine using HDMoLE's hierarchical gating.
Auxiliary Regularization:
- Employ load-balancing losses ( $\alpha_b \approx 0.01$ ) and z-losses ( $\alpha_z \approx 0.001$ ) for stable expert utilization.
Training Pipeline:
- Freeze backbone encoders and the LLM except for expert/router/adapter parameters.
- Adopt staged specialty-to-generalist training in language or accent-diverse setups.
Inference Procedure:
- Precompute encoder tokens, perform routing for each token, run only the selected experts, fuse outputs, and continue standard LLM decoding.
- In MoME/MRL systems, select desired compression rate per resource constraint.

These recommendations provide a reproducible approach for integrating sparse, efficient MoE mechanisms within speech-conditioned LLMs for AVSR and related tasks.

7. Interpretability, Specialization, and Future Directions

MoE-enhanced frameworks exhibit a high degree of interpretability: expert activation patterns are meaningful across domain, modality, or granularity (e.g., similar tokens at multiple compressions select similar experts) (Cappellazzo et al., 5 Oct 2025). This affirms that such architectures can learn stable, reusable specialization.

A plausible implication is that adding further experts for new domains (e.g., additional accents or languages) can be achieved modularly, without detrimental impact on legacy performance, especially when using hierarchical gating and thresholded routing (Mu et al., 12 Jul 2025, Mu et al., 2024). Furthermore, explicit prompt-aware routing signals support extensible multi-task deployment (Shan et al., 21 Feb 2025).

Key future research areas include automatic expert discovery, extension to more complex codeswitching or high-variability domains, refined hard routing/auxiliary supervision for expert gates, and further integration with end-to-end multimodal LLMs. The demonstrated scaling, interpretability, and efficiency of these approaches motivate continued exploration in both foundational and practical speech-LM systems.