Hybrid-Autoregressive Inference Transducers

Updated 10 February 2026

The paper introduces HAINAN models that unify AR, NAR, and SAR inference, providing flexible trade-offs between decoding speed and accuracy.
The models employ a joint network that fuses encoder and predictor outputs to generate tokens, durations, or visual embeddings for diverse applications.
Empirical results demonstrate improved WER and latency across speech and multimodal tasks, underscoring HAINAN’s efficiency and versatility.

Hybrid-Autoregressive INference TrANsducers (HAINAN) are a class of sequence modeling architectures designed to bridge the gap between fully autoregressive (AR) and non-autoregressive (NAR) generation, providing a principled mechanism for hybrid inference in applications such as speech recognition and multimodal LLMs. HAINAN models unify the strengths of AR and NAR paradigms, enabling flexible trade-offs between decoding accuracy and speed, and support additional semi-autoregressive (SAR) refinement. The paradigm has been instantiated in speech (HAINAN for ASR (Xu et al., 2024)) and multimodal reasoning (reasoning-switchable MLLMs in SwimBird (Tong et al., 5 Feb 2026)), where it has demonstrated state-of-the-art performance and practical versatility.

1. Core Modeling Principles

HAINAN models jointly represent and factorize the conditional probability of output sequences, employing separate but coupled mechanisms for distinct output modalities (discrete tokens, continuous embeddings, durations, etc.), and support both AR and NAR inference by explicit architectural and training design.

Speech Sequence Transduction

In speech applications, HAINAN extends the Token-and-Duration Transducer (TDT) model. The encoder produces acoustic frame embeddings. A predictor network, which may be an LSTM or a stateless last-token embedding, models label context. At each step, a joint network fuses encoder and predictor embeddings to predict both the next output token and an explicit duration (the number of frames to advance). The transducer models the joint distribution

$P(y \mid x) = \sum_{\pi \to y} \prod_{k=1}^K [\, p_{\text{token}}(v_k \mid t_k, u_k) \,\cdot\, p_{\text{dur}}(n_k \mid t_k, u_k)\, ]$

where $\pi$ denotes an alignment of (token, duration) pairs mapping acoustic frames to output labels. Stochastic predictor masking during training ensures that the joint network can operate both with and without history context (Xu et al., 2024).

Multimodal Reasoning

For multimodal LLMs, as realized in SwimBird, HAINAN enables joint AR inference over discrete text tokens $w_{1:T}$ and continuous visual-thought embeddings $z_{1:K}$ given image $x$ , by the factorization

$p_\theta(w_{1:T},\,z_{1:K}\mid x) = \prod_{t=1}^{T} p_\theta(w_t\mid w_{<t},\,z_{<0},\,x) \times \prod_{k=1}^{K} p_\theta(z_k\mid w_{1:T},\,z_{<k},\,x)$

The discrete (text) branch uses a standard transformer LLM head; the continuous (visual) branch regresses the next embedding. Dynamic interleaving is achieved by special delimiters defining text-only, vision-only, or interleaved reasoning blocks (Tong et al., 5 Feb 2026).

2. Architectural Components

HAINAN models exhibit a modular structure conducive to both AR and NAR operation, with unified architectures across diverse modalities.

Component	Speech HAINAN (Xu et al., 2024)	Multimodal HAINAN (SwimBird) (Tong et al., 5 Feb 2026)
Encoder	FastConformer stack over signal	Vision encoder (e.g., Qwen-ViT)
Predictor	LSTM (AR), stateless (NAR)	Transformer backbone for text/vision
Joint network	Combines encoder & predictor for token/dur	Fuses token & vision embeddings
Output heads	Token softmax, duration softmax	Token softmax, embedding regressor

In speech, the joint network adds encoder and predictor representations, then predicts a token and explicit duration per step. In multimodal HAINAN, the backbone transformer, augmented with a multimodal projector, supports both token and embedding generation depending on context markers.

3. Training Objectives and Methods

HAINAN models are trained with unified loss formulations enabling hybrid inference.

Speech HAINAN: Optimizes negative log-likelihood over the sum of all valid (token, duration) alignments. Random predictor masking (with probability 0.5) ensures the model can decode with or without label context. No explicit scheduled sampling is used beyond this stochastic masking (Xu et al., 2024).
Multimodal HAINAN: Employs a hybrid loss,

$\mathcal{L} = \lambda_{\rm text} \left(-\sum_{t \in T}\log p_\theta(w_t \mid w_{<t},x)\right) + \lambda_{\rm vis} \left(\sum_{k \in K} \|\hat z_k - z_k\|_2^2\right)$

where $\mathcal{L}_{\rm text}$ is cross-entropy over tokens and $\mathcal{L}_{\rm vis}$ is MSE over visual embeddings, with weights $\lambda_{\rm text}, \lambda_{\rm vis}$ (e.g., $\lambda_{\rm vis} = 0.2$ ) (Tong et al., 5 Feb 2026). This formulation supports samples containing only text, only embeddings, or both interleaved.

4. Inference Paradigms

HAINAN supports three main inference paradigms, each with specific accuracy and latency trade-offs.

1. Autoregressive (AR) Inference

Uses full label-history context via the predictor (speech) or full text/visual context (multimodal).
Decoding proceeds token-by-token (or span-by-span) in sequential order.
Offers highest accuracy but highest latency and lowest parallelism.
In ASR, AR-HAINAN achieves best WERs (e.g., English: 7.10%; German: 4.75–9.22%) (Xu et al., 2024).

2. Non-Autoregressive (NAR) Inference

Bypasses the predictor; only the encoder output is used for predictions.
Decoding can be performed fully in parallel over input frames.
Models transitions via a directed acyclic graph (DAG) and uses Viterbi decoding to find best output paths.
Offers low latency (parity with CTC), but typically with slight degradation in WER (e.g., English: 7.19%; German: 5.11–9.63%) (Xu et al., 2024).

3. Semi-Autoregressive (SAR) Inference

Computes an initial hypothesis using NAR decoding.
Iteratively refines each token prediction in parallel sweeps, each sweep conditioning on prior predictions.
One to two SAR passes close most of the accuracy gap between NAR and AR, with marginal additional latency (e.g., SAR-1: English WER 7.13%, latency 45 s vs. AR latency 89 s) (Xu et al., 2024).
In multimodal HAINAN, this decoding corresponds to alternating between text and embedding spans as signaled by delimiters (Tong et al., 5 Feb 2026).

5. Empirical Performance and Trade-Offs

HAINAN demonstrates robust improvements in accuracy and efficiency compared to predecessors across modalities.

Speech Recognition

On large English and German corpora, AR-HAINAN outperforms RNN-T and TDT by ≈0.03 WER, NAR-HAINAN outperforms CTC by ≈0.2 WER, and SAR further improves with ≈10–20% overhead (Xu et al., 2024).
Viterbi-based NAR decoding further matches or surpasses AR in some ablations.

Multimodal Reasoning

In SwimBird's implementation, hybrid autoregressive reasoning allows dynamic switching between textual and visual reasoning (or their interleaving), yielding state-of-the-art performance on both text-centric and vision-centric tasks (Tong et al., 5 Feb 2026).

Model	Mode	avg WER%	Time (s, Libri-other)
RNN-T	AR	7.15	179
TDT	AR	7.13	88
CTC	NAR	7.38	39
HAINAN	AR	7.10	89
HAINAN	NAR	7.19	41
HAINAN SAR-1	SAR	7.13	45
HAINAN SAR-2	SAR	7.12	48

6. Comparison to Other Methods

HAINAN unifies and generalizes AR, NAR, and SAR models. In ASR, it subsumes RNN-T (AR) and CTC (NAR), providing comparable or better accuracy in all regimes. Compared to hybrid approaches that independently combine CTC and TDT heads, a single HAINAN model achieves higher accuracy in both modes and naturally supports SAR refinement (Xu et al., 2024). In the multimodal domain, HAINAN's hybrid AR formulation advances beyond fixed-pattern or chain-of-thought-only reasoning by enabling adaptive, context-driven mode selection (Tong et al., 5 Feb 2026).

A notable property in speech is suppression of degenerate zero-duration predictions, preventing pathological infinite-loop behaviors, which is not addressed in standard CTC or hybrid-TDT-CTC frameworks (Xu et al., 2024).

7. Applications and Future Directions

HAINAN's modality-agnostic AR/NAR/SAR transduction supports efficient, accurate inference in demanding applications such as real-time speech recognition, on-device/mobile ASR, and multimodal reasoning in LLMs. The unification of AR and NAR modes facilitates deployment flexibility by allowing inference-time trade-offs between speed and accuracy. Emerging directions include extending the hybrid transducer to additional modalities (e.g., audio-visual, text-to-speech), optimizing SAR refinement depths, and refining dynamic reasoning-mode selection in complex multimodal tasks (Xu et al., 2024, Tong et al., 5 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

HAINAN: Fast and Accurate Transducer for Hybrid-Autoregressive ASR (2024)

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hybrid-Autoregressive INference TrANsducers (HAINAN).

Hybrid-Autoregressive Inference Transducers

1. Core Modeling Principles

Speech Sequence Transduction

Multimodal Reasoning

2. Architectural Components

3. Training Objectives and Methods

4. Inference Paradigms

1. Autoregressive (AR) Inference

2. Non-Autoregressive (NAR) Inference

3. Semi-Autoregressive (SAR) Inference

5. Empirical Performance and Trade-Offs

Speech Recognition

Multimodal Reasoning

Summary Table: Speech ASR (English) (Xu et al., 2024)

6. Comparison to Other Methods

7. Applications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Hybrid-Autoregressive Inference Transducers

1. Core Modeling Principles

Speech Sequence Transduction

Multimodal Reasoning

2. Architectural Components

3. Training Objectives and Methods

4. Inference Paradigms

1. Autoregressive (AR) Inference

2. Non-Autoregressive (NAR) Inference

3. Semi-Autoregressive (SAR) Inference

5. Empirical Performance and Trade-Offs

Speech Recognition

Multimodal Reasoning

Summary Table: Speech ASR (English) (Xu et al., 2024)

6. Comparison to Other Methods

7. Applications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics