Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hybrid-Autoregressive Inference Transducers

Updated 10 February 2026
  • The paper introduces HAINAN models that unify AR, NAR, and SAR inference, providing flexible trade-offs between decoding speed and accuracy.
  • The models employ a joint network that fuses encoder and predictor outputs to generate tokens, durations, or visual embeddings for diverse applications.
  • Empirical results demonstrate improved WER and latency across speech and multimodal tasks, underscoring HAINAN’s efficiency and versatility.

Hybrid-Autoregressive INference TrANsducers (HAINAN) are a class of sequence modeling architectures designed to bridge the gap between fully autoregressive (AR) and non-autoregressive (NAR) generation, providing a principled mechanism for hybrid inference in applications such as speech recognition and multimodal LLMs. HAINAN models unify the strengths of AR and NAR paradigms, enabling flexible trade-offs between decoding accuracy and speed, and support additional semi-autoregressive (SAR) refinement. The paradigm has been instantiated in speech (HAINAN for ASR (Xu et al., 2024)) and multimodal reasoning (reasoning-switchable MLLMs in SwimBird (Tong et al., 5 Feb 2026)), where it has demonstrated state-of-the-art performance and practical versatility.

1. Core Modeling Principles

HAINAN models jointly represent and factorize the conditional probability of output sequences, employing separate but coupled mechanisms for distinct output modalities (discrete tokens, continuous embeddings, durations, etc.), and support both AR and NAR inference by explicit architectural and training design.

Speech Sequence Transduction

In speech applications, HAINAN extends the Token-and-Duration Transducer (TDT) model. The encoder produces acoustic frame embeddings. A predictor network, which may be an LSTM or a stateless last-token embedding, models label context. At each step, a joint network fuses encoder and predictor embeddings to predict both the next output token and an explicit duration (the number of frames to advance). The transducer models the joint distribution

P(yx)=πyk=1K[ptoken(vktk,uk)pdur(nktk,uk)]P(y \mid x) = \sum_{\pi \to y} \prod_{k=1}^K [\, p_{\text{token}}(v_k \mid t_k, u_k) \,\cdot\, p_{\text{dur}}(n_k \mid t_k, u_k)\, ]

where π\pi denotes an alignment of (token, duration) pairs mapping acoustic frames to output labels. Stochastic predictor masking during training ensures that the joint network can operate both with and without history context (Xu et al., 2024).

Multimodal Reasoning

For multimodal LLMs, as realized in SwimBird, HAINAN enables joint AR inference over discrete text tokens w1:Tw_{1:T} and continuous visual-thought embeddings z1:Kz_{1:K} given image xx, by the factorization

pθ(w1:T,z1:Kx)=t=1Tpθ(wtw<t,z<0,x)×k=1Kpθ(zkw1:T,z<k,x)p_\theta(w_{1:T},\,z_{1:K}\mid x) = \prod_{t=1}^{T} p_\theta(w_t\mid w_{<t},\,z_{<0},\,x) \times \prod_{k=1}^{K} p_\theta(z_k\mid w_{1:T},\,z_{<k},\,x)

The discrete (text) branch uses a standard transformer LLM head; the continuous (visual) branch regresses the next embedding. Dynamic interleaving is achieved by special delimiters defining text-only, vision-only, or interleaved reasoning blocks (Tong et al., 5 Feb 2026).

2. Architectural Components

HAINAN models exhibit a modular structure conducive to both AR and NAR operation, with unified architectures across diverse modalities.

Component Speech HAINAN (Xu et al., 2024) Multimodal HAINAN (SwimBird) (Tong et al., 5 Feb 2026)
Encoder FastConformer stack over signal Vision encoder (e.g., Qwen-ViT)
Predictor LSTM (AR), stateless (NAR) Transformer backbone for text/vision
Joint network Combines encoder & predictor for token/dur Fuses token & vision embeddings
Output heads Token softmax, duration softmax Token softmax, embedding regressor

In speech, the joint network adds encoder and predictor representations, then predicts a token and explicit duration per step. In multimodal HAINAN, the backbone transformer, augmented with a multimodal projector, supports both token and embedding generation depending on context markers.

3. Training Objectives and Methods

HAINAN models are trained with unified loss formulations enabling hybrid inference.

  • Speech HAINAN: Optimizes negative log-likelihood over the sum of all valid (token, duration) alignments. Random predictor masking (with probability 0.5) ensures the model can decode with or without label context. No explicit scheduled sampling is used beyond this stochastic masking (Xu et al., 2024).
  • Multimodal HAINAN: Employs a hybrid loss,

L=λtext(tTlogpθ(wtw<t,x))+λvis(kKz^kzk22)\mathcal{L} = \lambda_{\rm text} \left(-\sum_{t \in T}\log p_\theta(w_t \mid w_{<t},x)\right) + \lambda_{\rm vis} \left(\sum_{k \in K} \|\hat z_k - z_k\|_2^2\right)

where Ltext\mathcal{L}_{\rm text} is cross-entropy over tokens and Lvis\mathcal{L}_{\rm vis} is MSE over visual embeddings, with weights λtext,λvis\lambda_{\rm text}, \lambda_{\rm vis} (e.g., λvis=0.2\lambda_{\rm vis} = 0.2) (Tong et al., 5 Feb 2026). This formulation supports samples containing only text, only embeddings, or both interleaved.

4. Inference Paradigms

HAINAN supports three main inference paradigms, each with specific accuracy and latency trade-offs.

1. Autoregressive (AR) Inference

  • Uses full label-history context via the predictor (speech) or full text/visual context (multimodal).
  • Decoding proceeds token-by-token (or span-by-span) in sequential order.
  • Offers highest accuracy but highest latency and lowest parallelism.
  • In ASR, AR-HAINAN achieves best WERs (e.g., English: 7.10%; German: 4.75–9.22%) (Xu et al., 2024).

2. Non-Autoregressive (NAR) Inference

  • Bypasses the predictor; only the encoder output is used for predictions.
  • Decoding can be performed fully in parallel over input frames.
  • Models transitions via a directed acyclic graph (DAG) and uses Viterbi decoding to find best output paths.
  • Offers low latency (parity with CTC), but typically with slight degradation in WER (e.g., English: 7.19%; German: 5.11–9.63%) (Xu et al., 2024).

3. Semi-Autoregressive (SAR) Inference

  • Computes an initial hypothesis using NAR decoding.
  • Iteratively refines each token prediction in parallel sweeps, each sweep conditioning on prior predictions.
  • One to two SAR passes close most of the accuracy gap between NAR and AR, with marginal additional latency (e.g., SAR-1: English WER 7.13%, latency 45 s vs. AR latency 89 s) (Xu et al., 2024).
  • In multimodal HAINAN, this decoding corresponds to alternating between text and embedding spans as signaled by delimiters (Tong et al., 5 Feb 2026).

5. Empirical Performance and Trade-Offs

HAINAN demonstrates robust improvements in accuracy and efficiency compared to predecessors across modalities.

Speech Recognition

  • On large English and German corpora, AR-HAINAN outperforms RNN-T and TDT by ≈0.03 WER, NAR-HAINAN outperforms CTC by ≈0.2 WER, and SAR further improves with ≈10–20% overhead (Xu et al., 2024).
  • Viterbi-based NAR decoding further matches or surpasses AR in some ablations.

Multimodal Reasoning

Model Mode avg WER% Time (s, Libri-other)
RNN-T AR 7.15 179
TDT AR 7.13 88
CTC NAR 7.38 39
HAINAN AR 7.10 89
HAINAN NAR 7.19 41
HAINAN SAR-1 SAR 7.13 45
HAINAN SAR-2 SAR 7.12 48

6. Comparison to Other Methods

HAINAN unifies and generalizes AR, NAR, and SAR models. In ASR, it subsumes RNN-T (AR) and CTC (NAR), providing comparable or better accuracy in all regimes. Compared to hybrid approaches that independently combine CTC and TDT heads, a single HAINAN model achieves higher accuracy in both modes and naturally supports SAR refinement (Xu et al., 2024). In the multimodal domain, HAINAN's hybrid AR formulation advances beyond fixed-pattern or chain-of-thought-only reasoning by enabling adaptive, context-driven mode selection (Tong et al., 5 Feb 2026).

A notable property in speech is suppression of degenerate zero-duration predictions, preventing pathological infinite-loop behaviors, which is not addressed in standard CTC or hybrid-TDT-CTC frameworks (Xu et al., 2024).

7. Applications and Future Directions

HAINAN's modality-agnostic AR/NAR/SAR transduction supports efficient, accurate inference in demanding applications such as real-time speech recognition, on-device/mobile ASR, and multimodal reasoning in LLMs. The unification of AR and NAR modes facilitates deployment flexibility by allowing inference-time trade-offs between speed and accuracy. Emerging directions include extending the hybrid transducer to additional modalities (e.g., audio-visual, text-to-speech), optimizing SAR refinement depths, and refining dynamic reasoning-mode selection in complex multimodal tasks (Xu et al., 2024, Tong et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hybrid-Autoregressive INference TrANsducers (HAINAN).