Speech-to-LLM Projector Overview

Updated 30 January 2026

Speech-to-LLM projector is a neural module that converts continuous speech encoder outputs into an LLM's embedding space to enable seamless cross-modal processing.
It employs diverse architectures—from simple linear/MLP and 1D CNNs to Transformers, MoE, and query-based designs—to optimize alignment, computational efficiency, and task transferability.
By integrating speech with frozen LLMs using alignment losses like cross-entropy and cosine similarity, it enhances performance in ASR, spoken QA, translation, and speech synthesis tasks.

A Speech-to-LLM projector is a neural module designed to map continuous speech representations—typically the outputs of a frozen or pre-trained speech encoder—into the input space of a LLM such that the downstream LLM can perform tasks (e.g., ASR, spoken QA, translation, speech-to-speech generation) in a cross-modal pipeline. This adaptation strategy is a core enabler for LLMs to directly process raw audio, and the design of the projector governs cross-modal alignment, computational efficiency, and task transferability (Ma et al., 2024, Fong et al., 7 Aug 2025, Mohapatra et al., 28 Jan 2026).

1. Architectural Taxonomy and Projector Variants

The architectural landscape of Speech-to-LLM projectors comprises several broad classes:

Simple Linear/MLP Projectors: The most common instantiation, as in SLAM-ASR, projects stacked or compressed speech encoder features (e.g., concatenated k frames) into the LLM embedding space via a single affine or small two-layer MLP with ReLU nonlinearity (Ma et al., 2024, Fong et al., 7 Aug 2025, Shao et al., 31 Dec 2025, Cappellazzo et al., 2024).
CNN-based Projectors: Some systems, such as GOAT-TTS, interpose temporal 1-D convolutions (over frames) to capture local structure before a final linear projection (Song et al., 15 Apr 2025).
Transformer-enhanced Projectors: SpeechMapper and Freeze-Omni introduce blocks of Transformer encoder layers—with or without convolutional temporal pooling—prior to the output mapping to support deeper cross-frame modeling (Mohapatra et al., 28 Jan 2026, Wang et al., 2024).
Mixture-of-Experts (MoE) Projectors: Recent systems address cross-lingual diversity by dispatching speech tokens through a stabilized MoE, using either soft gating (SMEAR-MoE) or sparsely-gated expert pools (Llama-SMoP DEDR), where each expert is typically a small MLP (Pandey et al., 27 Jan 2026, Cappellazzo et al., 20 May 2025).
CTC/Token-Posterior Projectors: LegoSLM dispenses with explicit projection and instead reconstructs speech embeddings as weighted sums over the LLM vocabulary embeddings, using CTC output posteriors to fuse acoustic-lexical and LLM information (Ma et al., 16 May 2025).
Q-Former/Query-based Projectors: For more abstract semantic alignment (notably in speech-to-speech translation), Q-Former modules perform cross-attention from a set of learnable query tokens to the full speech encoder output, then project to the LLM embedding dimension (Arya et al., 22 Jan 2026).
Dual Encoder and Language-Adaptive Projectors: Multilingual systems may leverage parallel encoders (e.g., Whisper + MMS) fused by a per-language weight gating mechanism, followed by a language-adapted connector (Xue et al., 2024).

2. Mathematical Formalization

The objective of a Speech-to-LLM projector is to generate a sequence of vectors $E_s \in \mathbb{R}^{T' \times d_{LLM}}$ that can be prepended or injected into the LLM input space, enabling the LLM to model $P(y_{1:N}\mid E_s, \text{prompt})$ for arbitrary downstream tasks. The canonical formulation is:

Linear/MLP: $E = \text{MLP}(\text{Downsample}(H_s))$ where $H_s$ is the encoder output and the MLP is typically $W_2 \operatorname{ReLU}(W_1 x + b_1)+b_2$ (Ma et al., 2024, Fong et al., 7 Aug 2025).
CNN-Based: $C^{(l)} = \operatorname{ReLU}(\operatorname{Conv}_l(C^{(l-1)}))$ , final $E_s = C^{(3)} W_{proj} + b_{proj}$ (Song et al., 15 Apr 2025).
MoE: $E_s = \operatorname{ReLU}(Z_s \bar W^{\top} + \bar b)$ where $(\bar W, \bar b)$ are soft mixtures of expert parameters weighted by a gating network's utterance-level aggregate (Pandey et al., 27 Jan 2026).
CTC/Posterior-weighted: $s_t = E \cdot o_t$ , where $o_t$ is the CTC posterior over vocabulary and $E$ is the LLM's embedding table (Ma et al., 16 May 2025).
Transformer/Adapter: $e_s = \text{LayerNorm}(W_{proj} z_s + b_{proj})+P_s$ for $z_s$ encoder outputs and $P_s$ an optional positional embedding (Wang et al., 2024).
Q-Former: Cross-attention blocks iteratively pool encoder features into $N_q$ learnable queries, followed by projection into LLM space (Arya et al., 22 Jan 2026).

The LLM is conditioned as $[E_{speech}; E_{prompt}; E_{text}]$ with position embeddings applied.

3. Training Objectives, Losses, and Optimization

The standard training paradigm for projectors is alignment via next-token cross-entropy (CE) loss:

$\mathcal{L}_{CE} = -\sum_{t=1}^{N} \log P_{LLM}(y_t|E_s, y_{<t}, E_{prompt})$

with only the projector (and optionally the speech encoder) parameters updated; the LLM weights are usually frozen during alignment (Ma et al., 2024, Song et al., 15 Apr 2025, Arya et al., 22 Jan 2026).

Auxiliary losses:
- Cosine alignment or MSE to ground-truth LLM embeddings (e.g., SpeechMapper) (Mohapatra et al., 28 Jan 2026).
- CTC loss at the speech encoder or connector output, especially in multilingual or fusion settings (Xue et al., 2024).
- Load-balancing for MoE routing losses (e.g., $L_{load}$ in SMEAR-MoE, $\mathcal{L}_b$ in SMoP) (Pandey et al., 27 Jan 2026, Cappellazzo et al., 20 May 2025).
- Multi-task or chain-of-thought objectives in TTS/S2ST (e.g., $\mathcal{L}_{ASR}$ , $\mathcal{L}_{Recon}$ in Spectron) (Nachmani et al., 2023).

Hyperparameters such as batch size, learning rate, and optimizer are typically selected from AdamW, with modest model-specific choices (e.g., $1$k steps linear warmup, fixed LR $1e^{-4}$ , batch sizes $4$–$128$) (Ma et al., 2024, Shao et al., 31 Dec 2025, Mohapatra et al., 28 Jan 2026).

4. Empirical Insights: Alignment, Efficiency, and Generalization

The performance and practical adoption of different projector strategies are governed by several key findings:

Sufficiency of Simple Architecture: Linear or shallow MLP projectors are sufficient for SOTA ASR on standard benchmarks. Heavyweight or deep projectors (Transformer, Q-Former, CNN) do not necessarily outperform simple MLPs for lexical tasks (Ma et al., 2024, Arya et al., 22 Jan 2026, Cappellazzo et al., 20 May 2025).
Rapid Cross-Modal Alignment: Freezing both the LLM and the speech encoder enables the alignment (as measured by next-token accuracy) to “emerge” rapidly (as few as 1–2k steps), whereas training/fine-tuning encoders delays convergence (Ma et al., 2024).
Transfer and Pretraining: Projectors pretrained on related, high-resource language data transfer effectively to low-resource settings, often requiring an order-of-magnitude less data to reach comparable WERs (e.g., 10–15 h fine-tuning vs. 100–200 h scratch) (Fong et al., 7 Aug 2025). Multilingual pretraining with balanced language mixes is most robust.
Zero-shot Modularity: Architectures such as LegoSLM or standardized CTC head approaches support modular “hot-swap” of encoders and LLMs, enabling flexible, zero-shot deployment by compatible projection protocol (Ma et al., 16 May 2025).
Prompt Sensitivity: LLM-based ASR is highly sensitive to textual prompts (“prompt engineering”). Learnable prompt projector modules can reduce this sensitivity and consistently improve WER (Burdisso et al., 28 Jan 2026).

A representative table of projector architectures, as referenced in primary literature:

Projector Type	Core Components	Example Papers
Linear/MLP	Linear/MLP, (k-frame stacking, ReLU)	(Ma et al., 2024, Fong et al., 7 Aug 2025)
1D CNN	3× Conv1D + linear	(Song et al., 15 Apr 2025)
Transformer stack	Conv1D, 6×Transformer, FC/Linear	(Mohapatra et al., 28 Jan 2026, Wang et al., 2024)
MoE (Soft/Sparse)	Expert MLPs + (soft/hard) routing	(Pandey et al., 27 Jan 2026, Cappellazzo et al., 20 May 2025)
Q-Former	$N_q$ learnable queries, cross-attn, projection	(Arya et al., 22 Jan 2026)
CTC/posterior	CTC over vocab, weighted embedding sum	(Ma et al., 16 May 2025)
Language-adapted	Per-language gating/fusion, adapters	(Xue et al., 2024)

5. Multilinguality, Modality Adaptation, and Mixture-of-Experts

Scaling projectors to cross-lingual or multimodal (e.g., audio-visual) settings introduces several innovations:

Dual Encoders and Language-weighted Fusion: Ideal-LLM fuses Whisper and MMS encoder representations via a per-language weight selector, optimizing for language-specific projection and mitigating performance asymmetries inherent to individual pretraining corpora. This yields significant relative WER reductions (up to 32%) over single-encoder connectors (Xue et al., 2024).
Stabilized MoE Routing: SMEAR-MoE dynamically merges M=4 expert projectors with dense gradient flow via soft utterance-level gating, promoting interpretable linguistic clustering and preventing expert collapse (Pandey et al., 27 Jan 2026). Sparse MoE approaches such as SMoP show practical gains in computational scaling relative to monolithic projectors (Cappellazzo et al., 20 May 2025).
CTC Alignment and Modularity: LegoSLM achieves language- and domain-agnostic modularity by decoupling speech encoder and LLM projection via CTC posteriors, with a softmax temperature controlling the strength of acoustic versus language priors during decoding (Ma et al., 16 May 2025).

6. Training Regimes, Zero-Shot and Task-Generalization

Distinct paradigms for projector optimization have arisen to address data availability and generalization trade-offs:

Instruction-free Self-supervision: AZeroS uses self-generated supervision from a frozen LLM, removing the need for curated instruction labels and enabling broader task generalization; only the projector’s parameters are updated (Shao et al., 31 Dec 2025).
Pretraining Without LLM Forward Passes: SpeechMapper separately pretrains projectors to reproduce the LLM’s embedding geometry, relying on MSE and cosine losses, followed by brief in-domain instruction tuning; this alleviates overfitting and improves data- and hardware-efficiency (Mohapatra et al., 28 Jan 2026).
Unified Encoders and Text-only Supervision: TESU-LLM achieves competitive spoken QA and ASR performance without ever seeing speech data during training, leveraging a frozen speech-text shared encoder and a small trainable projector mapped via MSE (embedding) and cross-entropy (language modeling) losses (Kim et al., 1 Jun 2025).

7. Current Limitations, Ablations, and Future Work

Major open directions and constraints:

No single universal projector: Monolithic projectors underperform in massively multilingual regimes; language-adaptive or MoE-based designs are favored for scalability and robustness (Pandey et al., 27 Jan 2026, Xue et al., 2024).
Ablation Gaps: Most studies report only end-to-end benchmarks. Direct ablations isolating projector efficacy (“with/without”, or replacement with random mapping) are rare, with the best available signals coming from WER/COMET/VoiceBench task metrics.
Prompt Handling and Robustness: Prompt engineering remains brittle. Learnable prompt projectors are effective but might not fully generalize to code-switching, cross-domain, or non-lexical paralinguistics (Burdisso et al., 28 Jan 2026).
Capacity/Alignment Trade-offs: Higher-capacity projectors (deep MLP, Transformer, Q-Former) converge faster but are more prone to overfit and less robust than lean, linear projectors, especially for speech-to-text tasks where temporal alignment is crucial (Arya et al., 22 Jan 2026).
Speech Expansion and Downstream Interaction: Systems jointly modeling TTS, AST, and conversation flow (e.g., Freeze-Omni, Spectron) rely on cross-modal projectors capable of back-and-forth flow, but maintain independence of speech and language representations to prevent catastrophic forgetting (Wang et al., 2024, Nachmani et al., 2023).