Leap Multi-Token Prediction (L-MTP)

Updated 3 December 2025

L-MTP is a family of sequence modeling techniques that predicts non-adjacent tokens, broadening contextual coverage and mitigating autoregressive bottlenecks.
It integrates leap heads, backward-filling decoders, and gated LoRA adapters to efficiently generate tokens and reduce error accumulation.
Empirical results demonstrate significant speed-ups and improved performance on benchmarks in code, math, and speech, highlighting its practical advantages.

Leap Multi-Token Prediction (L-MTP) is a family of sequence modeling techniques that generalize next-token prediction and conventional multi-token prediction (MTP) by enabling models to predict multiple, potentially non-adjacent future tokens in each forward pass. L-MTP combines architectural modifications, alternate supervision objectives, and specialized decoding strategies to overcome the inherent sequential bottlenecks and short-horizon myopia of autoregressive LLMs. By strategically "leaping" over intermediate positions during both training and inference, L-MTP methods promote broader contextual coverage, accelerate generation, and can enhance long-range reasoning, algorithmic generalization, and creative planning capability.

1. Principle and Taxonomy

L-MTP extends the blockwise multi-token prediction paradigm by allowing each prediction head to target a distinct, non-adjacent token offset from the input context, that is, for a shared trunk encoding $z_t$ of input $x_{1..t}$ , the $i$ -th head outputs $p(x_{t+k\cdot(i-1)+1} | z_t)$ where $k$ is the "leap stride" (Liu et al., 23 May 2025). Standard MTP corresponds to the special case $k=1$ . Compare next-token prediction (NTP), which is strictly sequential (one token per pass), and blockwise MTP, which generates $n$ adjacent tokens per step. L-MTP generalizes this to $n$ non-sequential positions, widening context exposure.

Table: Token targets for various paradigms

Method	Token positions per step	Typical stride ( $k$ )
NTP	$t+1$	1
MTP	$t+1, t+2, ..., t+n$	1
L-MTP	$t+1, t+1+k, ..., t+1+k(n-1)$	$k \geq 1$

This leap-based mechanism offers both theoretical and empirical advantages for speeding up inference, uncovering longer-range dependencies, and mitigating error accumulation (Liu et al., 23 May 2025, Samragh et al., 16 Jul 2025, Gloeckle et al., 2024).

2. Architectural Variants

Core L-MTP instantiations employ a shared Transformer block ("trunk") with multiple output heads:

Leap heads: Each head is configured to predict a non-adjacent token offset, determined by the stride $k$ .
Backward-filling cache: Non-leap positions are filled from cached previous predictions, using a backward-filling schedule (Liu et al., 23 May 2025).
Masked token framing (for speculative decoding): Introduce $k$ learned mask embeddings appended to each sequence, enabling autoregressive LLMs to predict several future tokens in parallel (Samragh et al., 16 Jul 2025).
Gated low-rank adapters (LoRA): During fine-tuning, gated LoRA adapters activate for masked ("future") positions only, preserving the backbone's NTP pathway (Samragh et al., 16 Jul 2025).

In speech modeling, MTP (and by extension L-MTP) employs stacks of causal Transformer layers, each with its own hidden state and output head, ensuring temporal dependencies (Wang et al., 5 Apr 2025).

3. Objective Functions and Training Schedules

L-MTP generally adopts a multi-token cross-entropy loss targeting leap positions:

$L_{\rm L-MTP} = -\sum_t \sum_{i=1}^{n} \log p_\theta(x_{t+k\cdot(i-1)+1} | z_{≤t})$

Recent implementations utilize two-stage training:

Head warm-up: Freeze backbone, train new leap heads via self-distillation.
Full model tuning: Jointly update backbone and heads, balancing next-token and multi-leap losses via a hyperparameter $\beta$ (Liu et al., 23 May 2025).

Mask-based L-MTP may also use auxiliary objectives:

Latent Consistency Matching (LCM): Enforces alignment between leap-token hidden states and autoregressive references.
Loss weighting: Decay loss contribution for longer leap distances (e.g., exponential decay, $\alpha^{k-1}$ ) (Wang et al., 5 Apr 2025).

Alternatives such as teacherless multi-token training (global objective over non-autoregressive prefixes) and discrete diffusion (reverse denoising over entire output sequence) amplify diversity and planning capabilities (Nagarajan et al., 21 Apr 2025).

4. Decoding Strategies and Inference Acceleration

L-MTP unlocks throughput gains by widening the prediction horizon:

Backward-filling decoding: After generating leap tokens, gaps are backfilled from previous cache states, reducing redundant passes (Liu et al., 23 May 2025).
Speculative blockwise decoding: Model outputs $k$ candidate tokens; each speculative token is verified via autoregressive comparison. Advanced variants deploy quadratic decoding trees to maintain high acceptance rates (Samragh et al., 16 Jul 2025).
Streaming chunked attention: In speech settings, each MTP/L-MTP pass attends only to a bounded history window, enabling real-time generation (Wang et al., 5 Apr 2025).

Theoretical analysis shows that with proper attenuation and consistency assumptions, L-MTP's accepted token length per iteration $E[L]_{\rm leap}$ dominates that of vanilla MTP as $n$ grows (explicit theorem in (Liu et al., 23 May 2025)).

Empirical speed-up is proportional to stride and number of heads; for instance, $n=4, k=2$ yields $7$ tokens/step ( $75\%$ speed-up over NTP, $40\%$ over MTP) (Liu et al., 23 May 2025); $k=8$ mask heads provide up to $5\times$ AR throughput (Samragh et al., 16 Jul 2025).

5. Empirical Performance and Analysis

L-MTP consistently matches or outperforms NTP and adjacent-block MTP baselines on code, mathematics, general knowledge, and speech benchmarks (Liu et al., 23 May 2025, Samragh et al., 16 Jul 2025, Gloeckle et al., 2024, Wang et al., 5 Apr 2025). Representative results:

Method	Benchmark	Accuracy/Pass@1 (LLama 3.2-3B)
NTP	GSM8K	3.71
MTP	GSM8K	3.87
L-MTP	GSM8K	5.91

In code/math tasks: L-MTP achieves $1.5$– $5\times$ decoding speed-ups with no quality regression (Samragh et al., 16 Jul 2025, Gloeckle et al., 2024).
Algorithmic generalization: Multi-token methods facilitate induction heads, improve in-context reasoning, and substantially enhance creative planning and global latent variable resolution in synthetic tasks (Gloeckle et al., 2024, Nagarajan et al., 21 Apr 2025).
Speech generation: Three- to five-fold speed-up with negligible degradation in WER and MOS (Wang et al., 5 Apr 2025).
Creativity and diversity: Teacherless or diffusion-style L-MTP plus seed-conditioning yields up to $5\times$ boost in algorithmic novelty and reduces memorization compared to standard NTP (Nagarajan et al., 21 Apr 2025).

6. Practical Considerations and Limitations

Model size and overhead: Adding leap/extra heads increases size slightly, requiring careful parameter balancing.
Stride/head selection ( $k, n$ ): Wider leap improves speed, but excessive stride attenuates prediction confidence. Empirically, $k=2, n=4$ balances accuracy and signal (Liu et al., 23 May 2025).
Acceptance rate vs. horizon: Speculative acceptance drops for large $k$ or unpredictable text; falls back to AR calls in creative settings (Samragh et al., 16 Jul 2025).
Training requirements: Head warm-up and sufficient self-distilled data are necessary for leap heads to function reliably.
Streaming or real-time generation: Chunked attention masks support latency-sensitive tasks (Wang et al., 5 Apr 2025).
Applicability: Benefits scale with model capacity and dataset size; marginal on pure multiple-choice benchmarks.

7. Extensions, Future Directions, and Context

Research highlights several avenues for advancing L-MTP:

Adaptive stride/head scheduling: Dynamically choose leap parameters based on prediction confidence or entropy (Liu et al., 23 May 2025).
Integration with RLHF: Align leap decisions with end-task rewards.
Compression: Combine with quantization/pruning/mixture-of-experts to mitigate head overhead.
Pretraining with leap objectives: Models trained from scratch may internalize broader planning (Samragh et al., 16 Jul 2025).
Seed-conditioning for creative diversity: Hash-conditioned input (input noise) enables coherent planning and high diversity, surpassing temperature sampling (Nagarajan et al., 21 Apr 2025).
Non-autoregressive and diffusion hybrids: Global sequence or iterative denoising further break the AR bottleneck (Nagarajan et al., 21 Apr 2025).

L-MTP is situated among a broader ecosystem of alternatives to next-token prediction, including plan-then-generate, latent reasoning, continuous generation, and non-Transformer architectures (Wyatt et al., 29 Sep 2025). Collectively, these trends reflect an emerging consensus that sequential, strictly local generation is insufficient for scalable, efficient, and creative language modeling.

References:

"L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for LLMs" (Liu et al., 23 May 2025)
"Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential" (Samragh et al., 16 Jul 2025)
"VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation" (Wang et al., 5 Apr 2025)
"Better & Faster LLMs via Multi-token Prediction" (Gloeckle et al., 2024)
"Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction" (Nagarajan et al., 21 Apr 2025)
"Alternatives To Next Token Prediction In Text Generation - A Survey" (Wyatt et al., 29 Sep 2025)