Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Published 12 May 2026 in cs.LG and cs.AI | (2605.12825v1)

Abstract: We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive LLMs with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion LLMs attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces Orthrus, unifying AR and diffusion pathways in a single transformer to enable memory-efficient, parallel token generation with exact fidelity.
The approach achieves up to 7.8x inference speedup by leveraging dual-view consensus and maintains constant O(1) KV cache memory, matching AR model accuracy.
The method requires minimal fine-tuning (16% parameters) and supports plug-and-play integration, making it ideal for high-throughput, resource-constrained LLM deployments.

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Motivation and Problem Setting

Efficient inference in LLMs is constrained by the inherently sequential nature of standard autoregressive (AR) decoding, which limits throughput and underutilizes modern hardware. Diffusion LLMs (DLMs) enable highly parallel token generation but suffer from significant losses in quality, increased training costs, and model convergence issues compared to AR models. Attempts to adapt pre-trained AR models with diffusion architectures have not succeeded in matching the fidelity of the original distributions and remain computationally intensive.

Orthrus Architecture

Orthrus directly addresses the speed–fidelity dichotomy by structurally unifying both AR and diffusion paradigms inside a single transformer backbone. The architecture consists of a dual-view design comprising:

A frozen AR head, solely responsible for contextual representation learning (pre-filling).
A lightweight, trainable diffusion head, injected in parallel to each transformer block, used for high-speed blockwise parallel generation.

Both heads attend to a shared, high-fidelity Key-Value (KV) cache. During generation, the diffusion head leverages this cache to propose K tokens in parallel, while the AR head is used for context pre-filling and for strict verification of the diffusion proposals.

A key architectural principle is that only the diffusion head is trained (∼16% of total parameters), leaving the AR backbone unchanged. This enables plug-and-play integration with any high-quality frozen LLM.

Consensus Mechanism and Exact Inference

Orthrus implements an intra-model consensus: during inference, the diffusion head projects a token block, and the AR head greedily verifies each position left-to-right against its own predictive distribution. Only tokens that perfectly match the AR distribution are accepted. The process repeats with the next anchor token, ensuring that the output matches the exact causal likelihood of the base AR model and fundamentally avoids conditional drift. This guarantees lossless inference fidelity and strictly aligns the output distribution with the underlying AR LLM.

For non-greedy sampling (temperature $T > 0$ ), the framework supports exact rejection sampling to preserve losslessness.

Training Methodology

The diffusion head is optimized via KL distillation against the frozen AR distribution. The training protocol randomly samples anchor positions in training sequences, forms masked blocks, and requires the diffusion view to predict masked tokens conditioned only on the visible anchors and shared AR-generated context, using specialized block-wise attention and masking (FlexAttention). This soft distillation enforces that the diffusion view not only regresses to the correct output but also matches the true distributional trajectory preferred by the AR base.

Empirical Results

Inference Speed and Throughput

Orthrus dramatically accelerates inference. On reasoning and algorithmic tasks, it achieves up to 7.8x speedup (measured as effective tokens per forward pass, TPF) at the 8B model scale. The parallel block size can be tuned to maximize throughput with no increase in latency or redundant memory usage. For instance, with $K=32$ , Orthrus reaches 6.35 TPF on MATH-500 without latency penalty.

Quality and Fidelity Benchmarking

Unlike pure diffusion models or adaptation strategies, Orthrus exhibits no loss in generation quality or accuracy compared to the AR baseline. It matches the zero-shot accuracy of the underlying AR model (e.g., Qwen3-8B) across diverse reasoning (GSM8K, MATH-500, AIME) and structural code generation tasks. Competing diffusion models and blockwise adaptation methods (such as Fast-dLLM-v2) observe accuracy drops up to 11 points on challenging benchmarks.

Parameter and Memory Efficiency

The architectural integration requires only minimal fine-tuning (16% parameters; <1B tokens; <24h on 8×H200), with strictly constant $O(1)$ KV cache memory overhead, regardless of sequence length. This contrasts strongly with speculative decoding systems and adapted DLMs, which introduce redundant caches or significant VRAM penalties.

Comparison with Speculative Decoding

Orthrus structurally obviates the need for an external drafter model, instead using the parallel diffusion head in situ. This leads to much higher average acceptance lengths (i.e., verified tokens per forward pass): on MATH-500, Orthrus yields 11.7 accepted tokens versus 7.9 for DFlash and 3.5 for EAGLE-3, with zero redundant cache burden.

Ablation Studies

Empirical ablations confirm that:

The throughput gain scales directly with parallel block size ( $K$ ), saturating with $K=32$ .
A single-step projection is optimal; multi-step denoising degrades throughput without any accuracy benefit.
Soft KL distillation is essential for high TPF; cross-entropy (hard labels) reduces effective throughput by 8%.

Limitations and Practical Implications

Orthrus's generative power is strictly upper-bounded by the frozen AR foundation model. The method accelerates inference throughput but does not improve base model quality or capabilities. Any biases or shortcomings in the AR model will be precisely inherited. However, Orthrus enables zero-compromise, high-throughput deployment of strong LLMs, making previously memory- or latency-bound applications feasible even on resource-constrained hardware.

The framework's symmetry and plug-and-play fine-tuning allow rapid adaptation to any modern AR LLM without extensive retraining, positioning it as a practical upgrade path for production systems needing both high quality and high inference throughput.

Future Directions

Open research directions include extending the dual-view consensus to support controlled output diversity beyond greedy or rejection sampling, scaling to even larger context lengths, and investigating generalization under domain shift (since only the diffusion head is retrained). The method also suggests new architectural paradigms wherein parallel generation modules are added and distilled onto strong frozen bases in other generative domains (e.g., vision or speech), not just language.

Conclusion

Orthrus structurally resolves the speed–fidelity trade-off in sequence generation by unifying AR and diffusion pathways inside a single transformer, enabling parallel token emission with mathematically guaranteed fidelity. The framework sets a new efficiency-quality Pareto frontier for high-throughput LLM inference while maintaining extreme parameter and memory efficiency, offering a direct migration path for existing models and opening new avenues in inference-time acceleration research.

Markdown Report Issue