Speech/Text Reasoning Gap Analysis

Updated 28 January 2026

Speech/Text Reasoning Gap is defined as the performance difference between speech and text inputs, typically ranging from 10-25 percentage points and exceeding 60 points in extreme cases.
Analyses show that while deep layers achieve high directional alignment between speech and text, persistent magnitude discrepancies indicate unresolved modality-specific challenges.
Targeted interventions like angle projection on poorly aligned tokens improve performance, though broad length normalization may degrade accuracy by removing useful prosodic cues.

The Speech/Text Reasoning Gap refers to the systematic, measurable shortfall in reasoning, comprehension, and general language understanding performance exhibited by models when processing inputs as speech rather than as text. Despite recent advances in end-to-end Large Speech LLMs (LSLMs), speech-based inputs continue to yield a significant drop in semantic understanding and complex reasoning compared to equivalent text-based inputs, even when controlling for content and task. This phenomenon has been observed consistently across diverse model architectures, benchmark suites, and alignment protocols, and is attributed to representational misalignment, architectural bottlenecks, and modality-specific challenges (Xiang et al., 14 Oct 2025).

1. Formal Definition and Quantification of the Modality Gap

The Speech/Text Reasoning Gap—often denoted Δ or "the modality gap"—is operationalized as the absolute degradation in benchmark performance for otherwise identical models when their inputs are cast in speech rather than text form. Let $M^t$ and $M^s$ represent the average scores or accuracies under text and speech inputs, respectively. The gap is then defined as: $\Delta = M^t - M^s$ For instance, across benchmarks such as VoiceBench, Uro-Bench, and VERA, empirical values of $\Delta$ for modern LSLMs are found in the range $10$–$25$ percentage points (Xiang et al., 14 Oct 2025), with extreme cases (e.g., long-chain math reasoning in VERA) showing gaps over $60$ points (e.g., $74.8\%$ for text vs. $6.1\%$ for voice on math) (Lin et al., 30 Sep 2025). This definition is consistent in other works, which define modality gap as relative accuracy loss, Modality Recovery Rate (MRR), or performance deltas on multi-hop multimodal tasks (Wang et al., 9 Jan 2026, Kim et al., 22 Aug 2025).

2. Alignment Mechanisms and Layerwise Representation Analysis

Systematic investigation reveals that although speech and text representations can become highly directionally aligned (as measured by cosine similarity) in deep network layers, there remain persistent discrepancies in the magnitude (e.g., layerwise Euclidean norm drift). Formally, for layer $l$ , mean-pooled speech and text hidden states $\bar{h}_l^s$ , $\bar{h}_l^t$ yield: $f_l^{cos} = \cos(\bar{h}_l^s, \bar{h}_l^t) \quad\text{and}\quad f_l^{dist} = \|\bar{h}_l^s - \bar{h}_l^t\|_2$ Cosine similarity $f_l^{cos}$ increases with depth and training, approaching $0.9$ in LoRA-tuned LSLMs, indicating strong directional matching. However, the magnitude gap $f_l^{dist}$ also grows with depth, indicating incompletely resolved modality-specific norms (Xiang et al., 14 Oct 2025).

Correlational studies show that both sequence-averaged cosine similarity and Euclidean distance are strongly predictive of the observed modality gap, with $R^2 \approx 0.75$ (cosine) and up to $0.88$ (Euclidean) under LoRA finetuning. Full-parameter tuning weakens these associations, suggesting that low-rank adaptation better preserves representational similarity between modalities (Xiang et al., 14 Oct 2025).

3. Fine-Grained Token-Level Alignment and Predictive Metrics

Beyond sequence-level averages, token-wise alignment between paired speech and text representations provides even stronger explanatory power. The Alignment Path Score (APS) is computed by, for each text token $j$ , identifying the most similar speech token $i_l^*(j)$ at each layer $l$ , then aggregating the maximal similarity over all layers and positions: $\mathrm{APS}^{(·)} = \frac{1}{L T_t} \sum_{l=1}^{L} \sum_{j=1}^{T_t} A_l^{(·)}[i^*_l(j),j]$ where $A_l^{(·)}$ is the token-wise similarity matrix with entries $A_l^{cos}[i,j]=\cos(h_{l,i}^s,h_{l,j}^t)$ (Xiang et al., 14 Oct 2025).

APS exhibits even tighter correlation with the modality gap ( $R^2$ up to $0.95$), surpassing sequence-level metrics. This demonstrates that high token-level alignment precision is critical for closing the performance gap, and that small subsets of poorly aligned tokens can disproportionately impair speech input performance.

4. Targeted Model Interventions and Empirical Effects

Interventions designed based on representational findings include:

Angle projection: Projecting poorly aligned speech embeddings onto their corresponding text directions, preserving speech norm.
Length normalization: Scaling speech token norms to match text, keeping direction unchanged.

Empirically, targeted angle projection on the three worst-aligned tokens (as ranked by APS) consistently improves speech correctness (e.g., $+1.63$ points for Qwen2.5-7B; $+3.44$ for Llama3.1-8B). In contrast, applying length normalization broadly can degrade performance, indicating that norm differences may encode useful prosodic or modality-specific information (Xiang et al., 14 Oct 2025).

5. Theoretical Interpretation and Model Design Implications

Despite strong improvements in directional alignment, residual magnitude divergence is robust—suggesting that acoustic-prosodic and other modality-specific cues are not fully suppressed nor required to be for good reasoning. This insight motivates several design principles:

Magnitude-aware alignment objectives: Supplementing cosine-based or contrastive objectives with norm-regularization or penalizing magnitude mismatches.
Token-level alignment supervision: Using explicit cross-modal alignment signals (e.g., forced alignments, attention priors) at the token level for content words.
Integrated architectural interventions: Introducing alignment or normalization layers explicitly into model architecture, particularly at key network depths.
Shared gating or alignment transformers: Mechanisms for enforcing convergence of both direction and scale of speech and text representations prior to feeding unified features into the LLM (Xiang et al., 14 Oct 2025).

These recommendations are supported by complementary analyses in multimodal translation (Yin et al., 2023), continual pretraining (Shi et al., 24 Feb 2025), cross-modal distillation (Cuervo et al., 15 Oct 2025), and RL-based representational alignment frameworks (Wang et al., 9 Jan 2026).

6. Broader Empirical Impact and Benchmarking

Large, systematic benchmarking frameworks such as VERA (Lin et al., 30 Sep 2025) and CMR-SPB (Kim et al., 22 Aug 2025) consistently document persistent deficits across reasoning, retrieval, and multi-hop tasks for speech inputs. Gaps persist even under optimal ASR, with only marginal effects from extended processing windows or improved TTS fidelity. Notably, real-time requirements sharply limit the attainable accuracy of voice-mode systems, creating a "latency–accuracy desert." Cascaded architectures partially recover performance but pay a price in logical coherence and introduce new error signatures.

Moreover, the speech/text gap is not uniform across task families. It is most severe for multi-step mathematical or logical reasoning, and for multi-hop, cross-modal reasoning chains that require integrating and connecting speech and textual evidence (2505.15000, Kim et al., 22 Aug 2025). Strategies that explicitly decouple internal reasoning from delivery, as in dual-brain or think-verbalize-speak architectures, recover much of the lost performance while enabling rapid interactive use (Wu et al., 10 Oct 2025, Woo et al., 19 Sep 2025).

7. Future Directions and Open Problems

Continued progress on narrowing the Speech/Text Reasoning Gap depends on methodological innovations spanning data, objectives, and architecture. Key research directions identified include:

Magnitude-regularized and tokenwise alignment: Extending contrastive and distillation losses to account for both direction and magnitude discrepancies and supervising fine-grained correspondences (Xiang et al., 14 Oct 2025).
High-fidelity token-level supervision: Leveraging forced-alignments or explicit speech-text alignment corpora for content-rich tokens.
Integrated cross-modal transformers: Embedding small, shared modules dedicated to aligning speech and text representations, possibly coupled with dynamic gating or soft sharing (Xiang et al., 14 Oct 2025).
RL-based and multi-granular trajectory alignment: Trajectory-alignment RL (e.g. TARS) and multi-level online self-distillation (e.g. CORD) that regularize both hidden states and output behavior (Wang et al., 9 Jan 2026, Hu et al., 23 Jan 2026).
Domain-adapted curricula: Incorporating specialized spoken benchmarks to address domain-specific representation and reasoning deficits, such as mathematical symbolic processing (2505.15000).
Efficient data curation: Leveraging synthetic data selection to efficiently target underrepresented alignment regions (Cuervo et al., 15 Oct 2025).

These avenues together provide a blueprint for future research targeting a thorough closure of the speech/text reasoning gap, moving toward modality-agnostic, deeply reasoned spoken language understanding (Xiang et al., 14 Oct 2025, Wang et al., 9 Jan 2026).