Hybrid-Autoregressive Inference Transducers
- The paper introduces HAINAN models that unify AR, NAR, and SAR inference, providing flexible trade-offs between decoding speed and accuracy.
- The models employ a joint network that fuses encoder and predictor outputs to generate tokens, durations, or visual embeddings for diverse applications.
- Empirical results demonstrate improved WER and latency across speech and multimodal tasks, underscoring HAINAN’s efficiency and versatility.
Hybrid-Autoregressive INference TrANsducers (HAINAN) are a class of sequence modeling architectures designed to bridge the gap between fully autoregressive (AR) and non-autoregressive (NAR) generation, providing a principled mechanism for hybrid inference in applications such as speech recognition and multimodal LLMs. HAINAN models unify the strengths of AR and NAR paradigms, enabling flexible trade-offs between decoding accuracy and speed, and support additional semi-autoregressive (SAR) refinement. The paradigm has been instantiated in speech (HAINAN for ASR (Xu et al., 2024)) and multimodal reasoning (reasoning-switchable MLLMs in SwimBird (Tong et al., 5 Feb 2026)), where it has demonstrated state-of-the-art performance and practical versatility.
1. Core Modeling Principles
HAINAN models jointly represent and factorize the conditional probability of output sequences, employing separate but coupled mechanisms for distinct output modalities (discrete tokens, continuous embeddings, durations, etc.), and support both AR and NAR inference by explicit architectural and training design.
Speech Sequence Transduction
In speech applications, HAINAN extends the Token-and-Duration Transducer (TDT) model. The encoder produces acoustic frame embeddings. A predictor network, which may be an LSTM or a stateless last-token embedding, models label context. At each step, a joint network fuses encoder and predictor embeddings to predict both the next output token and an explicit duration (the number of frames to advance). The transducer models the joint distribution
where denotes an alignment of (token, duration) pairs mapping acoustic frames to output labels. Stochastic predictor masking during training ensures that the joint network can operate both with and without history context (Xu et al., 2024).
Multimodal Reasoning
For multimodal LLMs, as realized in SwimBird, HAINAN enables joint AR inference over discrete text tokens and continuous visual-thought embeddings given image , by the factorization
The discrete (text) branch uses a standard transformer LLM head; the continuous (visual) branch regresses the next embedding. Dynamic interleaving is achieved by special delimiters defining text-only, vision-only, or interleaved reasoning blocks (Tong et al., 5 Feb 2026).
2. Architectural Components
HAINAN models exhibit a modular structure conducive to both AR and NAR operation, with unified architectures across diverse modalities.
| Component | Speech HAINAN (Xu et al., 2024) | Multimodal HAINAN (SwimBird) (Tong et al., 5 Feb 2026) |
|---|---|---|
| Encoder | FastConformer stack over signal | Vision encoder (e.g., Qwen-ViT) |
| Predictor | LSTM (AR), stateless (NAR) | Transformer backbone for text/vision |
| Joint network | Combines encoder & predictor for token/dur | Fuses token & vision embeddings |
| Output heads | Token softmax, duration softmax | Token softmax, embedding regressor |
In speech, the joint network adds encoder and predictor representations, then predicts a token and explicit duration per step. In multimodal HAINAN, the backbone transformer, augmented with a multimodal projector, supports both token and embedding generation depending on context markers.
3. Training Objectives and Methods
HAINAN models are trained with unified loss formulations enabling hybrid inference.
- Speech HAINAN: Optimizes negative log-likelihood over the sum of all valid (token, duration) alignments. Random predictor masking (with probability 0.5) ensures the model can decode with or without label context. No explicit scheduled sampling is used beyond this stochastic masking (Xu et al., 2024).
- Multimodal HAINAN: Employs a hybrid loss,
where is cross-entropy over tokens and is MSE over visual embeddings, with weights (e.g., ) (Tong et al., 5 Feb 2026). This formulation supports samples containing only text, only embeddings, or both interleaved.
4. Inference Paradigms
HAINAN supports three main inference paradigms, each with specific accuracy and latency trade-offs.
1. Autoregressive (AR) Inference
- Uses full label-history context via the predictor (speech) or full text/visual context (multimodal).
- Decoding proceeds token-by-token (or span-by-span) in sequential order.
- Offers highest accuracy but highest latency and lowest parallelism.
- In ASR, AR-HAINAN achieves best WERs (e.g., English: 7.10%; German: 4.75–9.22%) (Xu et al., 2024).
2. Non-Autoregressive (NAR) Inference
- Bypasses the predictor; only the encoder output is used for predictions.
- Decoding can be performed fully in parallel over input frames.
- Models transitions via a directed acyclic graph (DAG) and uses Viterbi decoding to find best output paths.
- Offers low latency (parity with CTC), but typically with slight degradation in WER (e.g., English: 7.19%; German: 5.11–9.63%) (Xu et al., 2024).
3. Semi-Autoregressive (SAR) Inference
- Computes an initial hypothesis using NAR decoding.
- Iteratively refines each token prediction in parallel sweeps, each sweep conditioning on prior predictions.
- One to two SAR passes close most of the accuracy gap between NAR and AR, with marginal additional latency (e.g., SAR-1: English WER 7.13%, latency 45 s vs. AR latency 89 s) (Xu et al., 2024).
- In multimodal HAINAN, this decoding corresponds to alternating between text and embedding spans as signaled by delimiters (Tong et al., 5 Feb 2026).
5. Empirical Performance and Trade-Offs
HAINAN demonstrates robust improvements in accuracy and efficiency compared to predecessors across modalities.
Speech Recognition
- On large English and German corpora, AR-HAINAN outperforms RNN-T and TDT by ≈0.03 WER, NAR-HAINAN outperforms CTC by ≈0.2 WER, and SAR further improves with ≈10–20% overhead (Xu et al., 2024).
- Viterbi-based NAR decoding further matches or surpasses AR in some ablations.
Multimodal Reasoning
- In SwimBird's implementation, hybrid autoregressive reasoning allows dynamic switching between textual and visual reasoning (or their interleaving), yielding state-of-the-art performance on both text-centric and vision-centric tasks (Tong et al., 5 Feb 2026).
Summary Table: Speech ASR (English) (Xu et al., 2024)
| Model | Mode | avg WER% | Time (s, Libri-other) |
|---|---|---|---|
| RNN-T | AR | 7.15 | 179 |
| TDT | AR | 7.13 | 88 |
| CTC | NAR | 7.38 | 39 |
| HAINAN | AR | 7.10 | 89 |
| HAINAN | NAR | 7.19 | 41 |
| HAINAN SAR-1 | SAR | 7.13 | 45 |
| HAINAN SAR-2 | SAR | 7.12 | 48 |
6. Comparison to Other Methods
HAINAN unifies and generalizes AR, NAR, and SAR models. In ASR, it subsumes RNN-T (AR) and CTC (NAR), providing comparable or better accuracy in all regimes. Compared to hybrid approaches that independently combine CTC and TDT heads, a single HAINAN model achieves higher accuracy in both modes and naturally supports SAR refinement (Xu et al., 2024). In the multimodal domain, HAINAN's hybrid AR formulation advances beyond fixed-pattern or chain-of-thought-only reasoning by enabling adaptive, context-driven mode selection (Tong et al., 5 Feb 2026).
A notable property in speech is suppression of degenerate zero-duration predictions, preventing pathological infinite-loop behaviors, which is not addressed in standard CTC or hybrid-TDT-CTC frameworks (Xu et al., 2024).
7. Applications and Future Directions
HAINAN's modality-agnostic AR/NAR/SAR transduction supports efficient, accurate inference in demanding applications such as real-time speech recognition, on-device/mobile ASR, and multimodal reasoning in LLMs. The unification of AR and NAR modes facilitates deployment flexibility by allowing inference-time trade-offs between speed and accuracy. Emerging directions include extending the hybrid transducer to additional modalities (e.g., audio-visual, text-to-speech), optimizing SAR refinement depths, and refining dynamic reasoning-mode selection in complex multimodal tasks (Xu et al., 2024, Tong et al., 5 Feb 2026).