Papers
Topics
Authors
Recent
Search
2000 character limit reached

Retrieval-Augmented Simultaneous Speech Translation

Updated 6 February 2026
  • RASST is a system that integrates streaming speech transcription with cross-modal retrieval to enhance translation of domain-specific and rare terms under real-time constraints.
  • It employs advanced modules like dual-encoder retrievers and incremental LLM agents, enabling dynamic glossary lookup and seamless translation generation.
  • Empirical evaluations show improved BLEU scores, higher Valid Information Proportion, and increased terminology accuracy compared to traditional simultaneous speech translation methods.

Retrieval-Augmented Simultaneous Speech Translation (RASST) denotes a class of systems that embed cross-modal retrieval mechanisms within simultaneous speech translation (SST) frameworks. These systems address the critical challenge of accurate, low-latency translation for domain-specific and rare terminology by interleaving streaming speech transcription, external knowledge retrieval, and incremental translation generation. RASST systems leverage recent advances in Speech LLMs (Speech LLMs), cross-modal retrievers, and efficient streaming architectures to produce translated output incrementally as source speech unfolds, and to do so with improved fidelity for specialized vocabulary (Luo et al., 30 Jan 2026, Cheng et al., 2024).

1. Problem Formulation and Motivations

Simultaneous speech translation (SST) aims to incrementally translate partial source speech s=(s1,s2,...)s=(s_1,s_2,...) into target text under real-time constraints. While conventional SST models have achieved substantial quality improvements using Speech LLMs, these models often underperform on translating rare or in-domain terminology due to a lack of explicit domain knowledge access. Retrieval augmentation—incorporating relevant external term mappings into the translation process—addresses this limitation. However, in the SST scenario, this requires high-throughput, accurate cross-modal (speech-to-text) retrieval that can operate on partial, streaming input and inform generation decisions dynamically (Luo et al., 30 Jan 2026, Cheng et al., 2024).

2. System Architectures

Recent RASST implementations interleave the following core modules in a tight loop:

  • Streaming Speech Encoder: Processes incoming audio in fixed-length or semantic chunks and generates continuous feature representations. Example architectures include large Conformers or Qwen3-Omni Audio Transformers.
  • Cross-Modal Speech–Text Retriever: Aligns short windows of speech with candidate glossary entries, typically using dual-encoder models trained on weakly aligned (speech, term) pairs. Textual queries may use encoders such as BGE-M3, while speech windows utilize attention-based pooling and projection.
  • Speech LLM Agent: An incremental, decoder-only Transformer (e.g., Doubao LLM) which ingests audio and textual embeddings, retrieved terminological hints, and history to output interleaved transcription and translation segments, while maintaining translation memory.
  • Controller & Memory Modules: Manage read/write decisions, glossary retrieval events, state updates, and segment boundary prediction (cut-off times).

Each round of decoding executes five actions: ingesting the next audio chunk, retrieving top-kk relevant glossary entries, loading transcript memory, generating translations/cut-off times, and updating system memory (Cheng et al., 2024, Luo et al., 30 Jan 2026).

3. Cross-Modal Retrieval Methodologies

RASST retrievers address two key requirements: speed and accuracy under continual partial input.

Dual-Encoder Training: Both text and speech encoders project their respective modalities into a shared dd-dimensional space (d=1024d=1024), trained via multi-positive InfoNCE, binary cross-entropy, or similar contrastive objectives on weakly aligned glossaries: Lret=Es,P,N[logpPexp(sim(fsw,fpe)/τ)pPexp(sim(fsw,fpe)/τ)+nNexp(sim(fsw,fne)/τ)]\mathcal L_{\mathrm{ret}} = -\mathbb E_{s,\mathcal P,\mathcal N} \left[ \log\frac {\sum_{p\in\mathcal P}\exp(\mathrm{sim}(f^w_s,f^e_{p})/\tau)} {\sum_{p\in\mathcal P}\exp(\mathrm{sim}(f^w_s,f^e_p)/\tau) + \sum_{n\in\mathcal N}\exp(\mathrm{sim}(f^w_s,f^e_n)/\tau)}\right]

Sliding-Window Querying: At inference, a fixed window of recent audio is encoded; the resulting vector queries a prebuilt FAISS index of text terms, aggregating top-K1K_1 terms per window. Chunk-wise results are consolidated, yielding a term map G^i\hat G_i for each SST step. Window sizes (WW) and stride (δ\delta) are empirically tuned (e.g., W=1.92W=1.92 s for optimal Recall@10, δ=0.48\delta=0.48 s) (Luo et al., 30 Jan 2026).

Retrieval Efficiency: Each retrieval step incurs at most 16% overhead relative to LLM decoding, with O(logG)O(\log|\mathcal G|) per query where G|\mathcal G| is the glossary size (Luo et al., 30 Jan 2026).

4. Translation Decoding, Read/Write Strategies, and Integration

The Speech LLM receives as input the current audio segment's encoding, the retrieved term map, historical memory, and instruction sequences. Integration is realized by formatting the term map—examples: JSON prefix ("term_map: ...") (Luo et al., 30 Jan 2026), in-context prompt pairs k1,v1,...,kK,vK\langle k_1,v_1,...,k_K,v_K\rangle (Cheng et al., 2024).

Read/Write Controllers: CLASI (Cheng et al., 2024) replaces traditional wait-kk policies with a learned, chunk-based policy mimicking professional human interpreters. Chunk boundaries are predicted as soon as semantic completeness is detected, and the LLM emits translation and cut-off time accordingly.

minθEtU[1,M][logpθ(Y1:j,tjX1:t)],j=max{j:tj<t}\min_\theta \mathbb{E}_{t\sim U[1,M]} \left[ -\log p_\theta(Y_{1:j},\,t^j \mid X_{1:t}) \right], \quad j=\max\{j: t^j < t\}

Training with Synthetic Data: Where explicit parallel (speech, term_map, translation) data are unavailable, retrieval-augmented SST instances are synthesized. For each chunk, the system may randomly assign "standard" (ground truth + hard negatives), "none," or "all-wrong" term maps to teach the LLM robust reliance on retrieved hints (Luo et al., 30 Jan 2026).

Empirical Integration Benefits: Retrieval augmentation in terminology-heavy settings yields substantial VIP increases (\sim10 points), with in-context learning boosting recall to 79% and F1 to 82.6% (Cheng et al., 2024).

5. Evaluation Metrics, Benchmarks, and Quantitative Results

Standard SST QA Metrics:

  • BLEU (SacreBLEU), BLEURT, and COMET: Used for global translation quality.
  • Latency: Measured via metrics such as Average Lagging (AL), Length-Adaptive AL (LAAL), and First-Letter Appearance Lagging (FLAL) (Cheng et al., 2024), or StreamLAAL (Luo et al., 30 Jan 2026).

Terminology-Specific Metrics:

  • Terminology Accuracy: Fraction of reference glossary terms appearing in hypothesis translation.
  • Valid Information Proportion (VIP) (Cheng et al., 2024): Proportion of semantic fragments preserving key term correctness, meaning accuracy, and fluency.

VIP=#valid semantic fragments#all fragments×100%\mathrm{VIP} = \frac{\#\text{valid semantic fragments}}{\#\text{all fragments}} \times 100\%

Key Results:

System VIP (%) BLEU Term Acc. Latency (AL, s)
CLASI (zh→en) 81.3 32.6 2.17
Best Commercial ≤41.6 25.2
RASST (En→Zh) 48.0 88
InfiniSST 45.0 72
  • CLASI achieves 81.3% VIP (zh→en) vs. 35.4% for best commercial systems and maintains low latency (2.17 s AL) (Cheng et al., 2024).
  • RASST (En→Zh, ACL devset) improves terminology translation accuracy by up to 16 points (72%→88%) and BLEU by up to 3 points over baselines at comparable latency; runtime overhead for retrieval remains under 16% (Luo et al., 30 Jan 2026).
  • For challenging ("extremely hard") dataset splits, CLASI maintains ~70% VIP, when others achieve <13% (Cheng et al., 2024).
  • Sliding-window retrieval (W=1.92W=1.92 s) achieves Recall@10 of 92.3% (paper-extracted glossary) (Luo et al., 30 Jan 2026).

6. Design Challenges, Ablations, and Limitations

Key Challenges:

  • Glossary Coverage Dependency: Absence of glossary entries for relevant terms limits retrieval efficacy.
  • Retriever Error Robustness: False positive terms may mislead the LLM; synthetic "all-wrong" patterns in training mitigate but do not eliminate vulnerability (Luo et al., 30 Jan 2026).
  • Trade-offs: Larger retrieval windows improve recall but entail greater memory and increased per-chunk latency. Optimal values (e.g., W=1.92W=1.92 s, δ=0.48\delta=0.48 s) balance these factors.
  • Action Skipping: CLASI's architecture enables potentially dynamic action skipping—e.g., bypassing retrieval for non-terminology segments—with future work suggested on this topic (Cheng et al., 2024).

Ablation Insights:

  • Negative Example Quality: BGE-M3-generated (retriever) negatives result in strongest downstream BLEU (47.5/48.8) and Term Acc (89.3/83.9), outperforming random or LLM-generated negatives (Luo et al., 30 Jan 2026).
  • Segment Policy: Learned, semantic chunk-based read/write yields deterministic and latency-stable operation without redundant translations (Cheng et al., 2024).

7. Future Directions

Prominent extensions proposed include:

  • Joint optimization of retriever+LLM (possibly via weak supervision based on translation correctness) for tighter coupling and robustness (Luo et al., 30 Jan 2026).
  • Enhanced glossaries, with dynamic updates during events (e.g., adding terms in real events/conferences).
  • Extending RASST to speech-to-speech and multilingual simultaneous translation, as well as video modalities (Cheng et al., 2024).
  • Further reduction in latency through hardware-aware design, streamlined audio adapters, and more efficient retrieval (Cheng et al., 2024).
  • Improvements to evaluation and reward functions, e.g., better metrics for complex long-form simultaneous interpretation and multi-modal RLHF (Cheng et al., 2024).
  • Hard constraints or lattice-based decoding to guarantee correct integration of retrieved terminology (Luo et al., 30 Jan 2026).

References

  • "RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation" (Luo et al., 30 Jan 2026).
  • "Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent" (Cheng et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Retrieval-Augmented Simultaneous Speech Translation (RASST).