Auxiliary Multi-Token Prediction

Updated 6 February 2026

Auxiliary multi-token prediction objectives are methodological augmentations to next-token prediction that improve sample efficiency, representation quality, and inference speed.
They employ various architectural designs—parallel heads, serial modules, and joint prediction—to forecast multiple future tokens with minimal overhead.
Empirical studies reveal enhanced generalization and throughput across modalities, making these objectives vital for large-scale sequence modeling.

Auxiliary multi-token prediction objectives comprise a family of methodological augmentations to the standard next-token prediction (NTP) paradigm in deep sequence models. These objectives require a model to make predictions about multiple future discrete tokens—rather than only the immediate successor—based on a given context or state. Auxiliary multi-token prediction has been deployed as a training signal in a variety of modalities (language, speech, code, vision) and architectures, with demonstrable benefits for sample efficiency, representational structure, downstream accuracy, and inference throughput. Architecturally, these objectives range from simple parallel heads sharing the base model trunk, to serial or joint-predictive modules, to additional control or routing mechanisms. Training regimes include pure auxiliary loss addition, teacher forcing, and permutation-invariant token assignments. This article provides a comprehensive technical synopsis and contextualization of auxiliary multi-token prediction objectives, with particular detail given to their formalization, architectures, empirical effectiveness, and targeted applications.

1. Formal Definition and Core Objective

Auxiliary multi-token prediction augments the canonical autoregressive cross-entropy objective by teaching the model to predict $N$ future tokens $\{x_{t+1},..., x_{t+N}\}$ given a context $x_{1:t}$ :

$L_{n} = - \sum_{t} \sum_{i=1}^{n} \log P_\theta(x_{t+i} | x_{1:t})$

This is most frequently instantiated as $n$ parallel cross-entropies, each corresponding to its own “head” projected from a shared model trunk, a paradigm central in “Better & Faster LLMs via Multi-token Prediction” (Gloeckle et al., 2024). Alternative variants model the joint probability of the entire block of tokens (e.g., via tensor decomposition (Basharin et al., 2024)) or use serial prediction modules that sequentially condition on previously predicted tokens, as in the Auxiliary-Token Serial Prediction module of CodeSep (Du et al., 19 Jan 2026).

Auxiliary multi-token prediction can be decomposed as:

Marginal (per-head, parallel): $n$ independent predictions, each requires only predicting $x_{t+i}$ from $x_{1:t}$ .
Serial or autoregressive: each prediction for offset $i$ conditions on the ground-truth (or previously predicted) intermediate tokens, e.g., $P(x_{t+2} | x_{1:t}, x_{t+1})$ .
Joint (blockwise, low-rank, or mixture): direct modeling of the full joint $\{x_{t+1},..., x_{t+N}\}$ 0, often via low-rank mixture-of-expert decompositions (Basharin et al., 2024).

The overall training loss usually combines the standard next-token loss with the auxiliary multi-token loss, often with a tradeoff coefficient:

$\{x_{t+1},..., x_{t+N}\}$ 1

2. Architectural Realizations

Auxiliary multi-token prediction objectives give rise to a range of architectural augmentations:

Independent output heads: Each future offset receives its own one-layer (or deeper) projection (“Multi-token prediction via parallel heads” (Gloeckle et al., 2024, Zhang et al., 20 Jul 2025)). With $\{x_{t+1},..., x_{t+N}\}$ 2 future tokens predicted, there are $\{x_{t+1},..., x_{t+N}\}$ 3 additional small heads atop the shared transformer trunk.
Shared or parameter-efficient heads: For resource constraints, one can share or re-use the unembedding projections, applying only lightweight adapters (e.g., LoRA), instead of full new projection layers (Zhang et al., 20 Jul 2025).
Serial prediction modules: In CodeSep’s ATSP, a stack of sub-predictors serially predicts auxiliary tokens, such that the $\{x_{t+1},..., x_{t+N}\}$ 4th sub-predictor receives as input the embeddings of base tokens plus all previously predicted auxiliary tokens, producing the next token distribution (Du et al., 19 Jan 2026). This chain implements the conditional product $\{x_{t+1},..., x_{t+N}\}$ 5.
Low-rank MoE decompositions: Instead of $\{x_{t+1},..., x_{t+N}\}$ 6 independent heads, tensor-factorized heads approximate the joint conditional via a mixture of experts (Basharin et al., 2024).
Register-token interleaving: MuToR interleaves learnable register tokens into the sequence, each tasked with predicting a future token at a specified offset, with register predictions masked from regular token flows (Gerontopoulos et al., 15 May 2025).

Typical modifications do not increase trunk model depth. Instead, added computation is (a) shallow compared to the trunk, and (b) limited to feed-forward or output-projection layers, leading to negligible parameter overhead and minimal throughput loss for small to moderate $\{x_{t+1},..., x_{t+N}\}$ 7 (Gloeckle et al., 2024).

3. Training Regimes and Losses

Teacher-forcing cross-entropy: For sequential modules such as CodeSep’s ATSP, training uses ground-truth future tokens at every prediction step, rather than the model’s own previous predictions, which stabilizes training and prevents error propagation (Du et al., 19 Jan 2026).
Permutation-invariant supervision: In multi-source problems (e.g., speech separation), permutation-invariant cross-entropy is used for source-agnostic objectives, though auxiliary-token modules anchored to individual entities typically do not require this (Du et al., 19 Jan 2026).
Register-based auxiliary loss: MuToR applies register-conditional cross-entropy only to register-token positions, summed with the primary NTP objective, with register tokens masked during inference (Gerontopoulos et al., 15 May 2025).
Weighted or discounted multi-token loss: Some models apply a discount factor for more distant tokens, emphasizing short-horizon prediction.
Representation bottlenecking: JTP introduces a small bottleneck module to prevent trivial copying when teacher-forced tokens are available; an attention-based mechanism combines backbone hidden state and ground-truth prefixes before the auxiliary prediction head (Ahn et al., 24 Mar 2025).

The cumulative loss is typically summed or linearly combined, with the auxiliary term’s weight tuned according to downstream performance characteristics and dataset/task properties (Gerontopoulos et al., 15 May 2025).

4. Empirical Efficacy and Characterization

Extensive experiments substantiate several general trends regarding auxiliary multi-token prediction:

Improved generalization on generative and planning tasks: On HumanEval and MBPP code benchmarks, multi-token prediction yields +12–17% absolute gains over NTP-only models for large model scales (e.g., 13B) (Gloeckle et al., 2024). Visual-planning models with multi-token heads outperform baselines by 3–7% on task success rate for COIN/CrossTask (Zhang et al., 20 Jul 2025).
Enrichment of representations: Models trained with these objectives display smoother, more semantically coherent top-layer embeddings, with improved ability to encode global (non-local) future context. This is evidenced by improved classification accuracy on downstream tasks using frozen representations and by increased topic coherence in generation (Walker, 2024).
Sample efficiency and scaling: Gains are increasingly prominent for models at or above $\{x_{t+1},..., x_{t+N}\}$ 81B parameters and on large, generative datasets (Gloeckle et al., 2024, Zhang et al., 20 Jul 2025). Small models and non-generative tasks exhibit inconsistent or sometimes negative effects (Zuhri et al., 26 Aug 2025).
Inference throughput: Use of auxiliary multi-token prediction facilitates blockwise or speculative decoding, amortizing cost across up to $\{x_{t+1},..., x_{t+N}\}$ 9 tokens per forward pass and realizing up to 3–6 $x_{1:t}$ 0 speedup, especially when paired with speculative verification (Gloeckle et al., 2024, Basharin et al., 2024, Samragh et al., 16 Jul 2025).
Inter-step dependency modeling: Auxiliary objectives enable the model to more reliably internalize longer-range dependencies and model induction, improving both algorithmic reasoning and structured output generation (Gloeckle et al., 2024, Ahn et al., 24 Mar 2025, Du et al., 19 Jan 2026).

5. Practical Considerations and Hyperparameterization

Practical deployment of auxiliary multi-token objectives requires attention to the following:

Hyperparameter / Architecture	Common Value / Design	Reference Example
Number of auxiliary heads $x_{1:t}$ 1	$x_{1:t}$ 2– $x_{1:t}$ 3	(Gloeckle et al., 2024, Zhang et al., 20 Jul 2025)
Serial vs parallel heads	Both used	(Gloeckle et al., 2024, Du et al., 19 Jan 2026)
Register count / offset	3–4 steps in language	(Gerontopoulos et al., 15 May 2025)
Loss weighting $x_{1:t}$ 4 or $x_{1:t}$ 5	$x_{1:t}$ 6– $x_{1:t}$ 7	(Gerontopoulos et al., 15 May 2025)
Use of teacher-forcing	Yes (for serial heads	(Du et al., 19 Jan 2026, Ahn et al., 24 Mar 2025)
Architecture overhead	<=10% for $x_{1:t}$ 8	(Gloeckle et al., 2024, Gerontopoulos et al., 15 May 2025)

Larger $x_{1:t}$ 9 usually improves results up to a scale-dependent threshold; excessive $L_{n} = - \sum_{t} \sum_{i=1}^{n} \log P_\theta(x_{t+i} | x_{1:t})$ 0 can make the auxiliary task too difficult, with greater benefit on byte-level or longer-context data (Gloeckle et al., 2024).
Parameter-efficient variants (e.g., LoRA-based lightweight heads) are required for memory/resource-constrained settings in large multimodal models (Zhang et al., 20 Jul 2025).
Careful balancing of auxiliary and primary loss ( $L_{n} = - \sum_{t} \sum_{i=1}^{n} \log P_\theta(x_{t+i} | x_{1:t})$ 1, $L_{n} = - \sum_{t} \sum_{i=1}^{n} \log P_\theta(x_{t+i} | x_{1:t})$ 2) is necessary to avoid over-regularization or slow convergence, especially on domain-specific datasets (Gerontopoulos et al., 15 May 2025).
For models requiring accurate alignment (e.g., speech separation), abstaining from permutation-invariant losses on auxiliary-token modules is necessary to preserve tractability (Du et al., 19 Jan 2026).

6. Limitations and Challenges

Not all instantiations of auxiliary multi-token prediction yield uniform improvements across tasks and model sizes:

Task and scale sensitivity: On standard NLP benchmarks, especially with small models ( $L_{n} = - \sum_{t} \sum_{i=1}^{n} \log P_\theta(x_{t+i} | x_{1:t})$ 31B), MTP objectives can underperform or even degrade next-token accuracy except for certain code/math or planning settings (Zuhri et al., 26 Aug 2025, Gloeckle et al., 2024).
Specialization for next-token: In models pretrained on pure next-token prediction, hidden representations become “early-specialized” for this objective, which limits the information available to auxiliary MTP heads grafted post-hoc. Joint pretraining from scratch with MTP objectives can partially remedy this bottleneck, but cannot fully overcome it (Mehra et al., 13 Feb 2025).
Hyperparameter sensitivity: The optimal number of heads, loss weight, and prediction horizon are all highly task, dataset, and size dependent. Excessive heads or loss weight may slow convergence or dilute signal (Gerontopoulos et al., 15 May 2025).
Compute tradeoffs: While inference is accelerated (via blockwise decoding), training requires more computation per update. For low $L_{n} = - \sum_{t} \sum_{i=1}^{n} \log P_\theta(x_{t+i} | x_{1:t})$ 4 (up to 4), this overhead is minor; for larger $L_{n} = - \sum_{t} \sum_{i=1}^{n} \log P_\theta(x_{t+i} | x_{1:t})$ 5, memory constraints may become significant (Gloeckle et al., 2024).
Complexity of latent dependencies: For sequence types with weak short-range dependency (e.g., certain classification settings), auxiliary MTP objectives may offer little or no benefit (Gloeckle et al., 2024, Zuhri et al., 26 Aug 2025).

7. Future Directions and Extensions

Current research indicates several promising lines of development:

Blockwise, speculative, and parallel decoding: Low-rank or blockwise multi-token objectives support parallel token generation, pushing the sampling throughput bottleneck in large-scale sequence generation (Basharin et al., 2024, Draxler et al., 24 Dec 2025, Samragh et al., 16 Jul 2025).
Serial prediction and representation bottlenecking: Architectures such as ATSP (Du et al., 19 Jan 2026) and JTP (Ahn et al., 24 Mar 2025) implement serially conditional modules or bottlenecks to explicitly capture the residual and belief-state information missed by shallow parallel heads.
Auxiliary planning and structure modeling: In structured tasks (visual planning (Zhang et al., 20 Jul 2025), algorithmic/logical reasoning (Ahn et al., 24 Mar 2025)), auxiliary multi-token objectives facilitate long-horizon credit assignment and strengthen emergent planning behavior.
Hybrid objectives and summary heads: Future-summary-prediction (Mahajan et al., 16 Oct 2025) and listwise token ranking (Zuhri et al., 26 Aug 2025) demonstrate that relaxing the auxiliary target (to bags-of-tokens, summary representations, or ordering rather than exact string matching) can yield more robust benefits, especially for long-horizon or complex generation tasks.
Permutation-invariant and multi-source extensions: For scenarios involving entity disentanglement or source separation (as in CodeSep), the integration of serial auxiliary prediction with permutation-invariant supervision enables scalable and interpretable sequence modeling with minimal output bandwidth (Du et al., 19 Jan 2026).
Compositional and uncertainty-driven register/token placement: Register-based schemes can be extended with smarter placements (uncertainty, predicted choice points) and adaptive horizon selection.

Auxiliary multi-token prediction has thus emerged as a foundational ingredient for improving sample efficiency, planning, and throughput in large-scale sequence models. State-of-the-art models leverage these objectives for reasoning, compositional generalization, and efficient deployment, and ongoing research is progressing toward further architectural and loss-structural innovation across modalities and domains (Du et al., 19 Jan 2026, Gloeckle et al., 2024, Gerontopoulos et al., 15 May 2025, Zhang et al., 20 Jul 2025, Mahajan et al., 16 Oct 2025).