Dynamic Token Routing in Neural Models

Updated 19 January 2026

Dynamic token routing is a method that adaptively assigns computational resources to individual tokens based on their contextual complexity.
It leverages techniques such as sequence-global expert selection, token-level gating, and hybrid compute paths, yielding notable accuracy and efficiency gains.
This approach underpins modern architectures in NLP and vision, enabling improved memory utilization, reduced compute load, and enhanced scalability.

Dynamic token routing is a suite of architectural and algorithmic innovations that enable neural models—especially Transformers and Mixture-of-Experts (MoE) architectures—to allocate computation adaptively at token granularity. Rather than statically applying the same compute operations to every token in a sequence, dynamic routing mechanisms learn to distribute attention, expert capacity, or computational depth based on the contextual difficulty or utility of each token. This results in substantially improved efficiency, memory utilization, and robustness, as well as strong gains in model accuracy and scalability. Dynamic token routing has been realized in multiple forms: sequence-global expert selection (e.g. SeqTopK in MoE), adaptive per-token gating, similarity-driven expert assignment, and per-layer token-depth variation. These mechanisms are now central to both LLMs and high-resolution vision transformers.

1. Sequence-Global and Adaptive Expert Routing

Dynamic expert routing in MoE architectures was initially token-centric, with each token independently assigned its top- $K$ experts via local gating scores. Sequence-Level TopK (SeqTopK), introduced by Kim et al. (Wen et al., 9 Nov 2025), reformulates this process so that the expert budget— $T \cdot K$ for $T$ tokens and $K$ experts per token—is allocated globally for a whole sequence. Rather than selecting the top- $K$ experts per individual token, SeqTopK flattens the gating scores into a $T \times N$ matrix and selects the top $T \cdot K$ expert assignments worldwide. Mathematically, if $g_{t,i}$ is the gating score for token $t$ and expert $i$ , then the expert routing mask is set by taking the global threshold:

$T \cdot K$ 0

where $T \cdot K$ 1 is the $T \cdot K$ 2-th largest element of all $T \cdot K$ 3.

This mechanism naturally assigns more experts to tokens with high context-complexity and fewer experts to trivial tokens, without changing the overall budget. SeqTopK requires merely a global top-K selection step and retains full compatibility with pretrained MoE checkpoints, incurring less than 1% computation and memory overhead. Empirical results show up to 16.9% accuracy improvements under extreme sparsity, with the greatest gains on mathematics, coding, law, and writing tasks (Wen et al., 9 Nov 2025).

Adaptive expert allocation has also been realized with learnable continuous relaxation (LD-MoLE), similarity graphs (S-MoE/A-MoE), and bidirectional token/expert selection (ETR MoE). Key advances include differentiable routing functions such as Sparsegen, token-layer adaptive sparsity (LD-MoLE (Zhuang et al., 30 Sep 2025)), and affinity-driven selection with grouped average pooling (Li et al., 2024). These innovations dynamically tune the number of experts per token and mitigate expert collapse, underfitting of rare tokens, and late-stage routing fluctuations (Nguyen et al., 1 May 2025, Su et al., 2024).

2. Token-Level Gating, Routers, and Hybrid Compute Paths

At the core of dynamic token routing lies a trainable gating function or router—typically a lightweight MLP or transformer subnetwork—that ingests per-token embeddings and context features, producing probabilities or hard decisions over computational units (experts, layers, attention blocks).

In MambaFormer (Khan et al., 3 Jan 2026), a 2-layer MLP router takes in embedding vectors, normalized sequence length, and domain flags, and outputs softmax scores over two expert types: Transformer (ET5) or State Space Model (EMamba). Hard selection is enforced at inference. Utility-guided multi-objective loss ensures load-balance, latency control, and accuracy via penalty terms and balancing objectives.

DTRNet (Sharma et al., 31 Aug 2025) routes each token at each layer through either full attention (quadratic) or a lightweight linear update, learned by a two-layer router with SiLU activation and softmax, yielding substantial reduction in FLOPs: only ≈10% of tokens use attention per layer, leading to 15–25% compute savings with minimal loss in perplexity or accuracy.

In computer vision, dynamic routers decide whether to send tokens through costly global attention or lightweight refiners (MEMatte (Lin et al., 2024)) and orchestrate multi-path propagation, branching, or skipping (DiT (Ma et al., 2023)). These routers often employ token-wise Gumbel-Softmax gating with budget or compression constraints, enabling end-to-end learning of routing policies rather than reliance on static thresholds.

3. Graph- and Attention-Based Coupled Routing

Addressing instability and fluctuations in independent per-token routing, recent models let tokens influence each other's expert assignments by leveraging similarity graphs or the attention matrix itself.

Similarity-Aware MoE (S-MoE) (Nguyen et al., 1 May 2025) builds a token similarity matrix and blends gate scores across neighbors, reducing selection entropy and improving robustness. The updated gate for token $T \cdot K$ 4 incorporates its own score and a weighted average of neighbors' scores. Attention-Aware MoE (A-MoE) further integrates self-attention matrices to link routing across tokens, coupling expert assignment to contextual interaction.

The key theoretical result is that soft merging of gate distributions cannot increase entropy; instead, similarity- or attention-coupled gates sharpen selection and statistically stabilize routing, as evidenced by reductions in fluctuation rate, improved load balance, and consistent accuracy gains across modalities.

4. Capacity Constraints, Flow Formulations, and Hardware Efficiency

Hardware efficiency for deployed systems depends critically on the routing algorithm's ability to avoid token dropping (over-saturated experts) and padding (under-utilized experts). Maximum Score Routing (MaxScore) (Dong et al., 18 Aug 2025) frames token-to-expert routing as a minimum-cost maximum-flow problem:

$T \cdot K$ 5

subject to feasible assignment and capacity, where $T \cdot K$ 6 is the affinity score, and $T \cdot K$ 7 is a binary token-expert assignment.

A differentiable SoftTopK operator allows gradient learning while ensuring token quota and expert capacity are both satisfied. Compared to prior iterative or OT-based methods, MaxScore achieves zero token drop, near-perfect load balancing, efficient batched execution, and up to 1.3% absolute accuracy improvement at matched FLOPs over standard GShard or DropLess baselines.

Bidirectional routing frameworks (ETR MoE) (Li et al., 2024) further dynamically switch between token-choice routing (maximizing the initial training success rate under high irrelevant-token density) and expert-choice routing (maintaining specialization as experts mature), theoretically reducing the expert capacity requirement by up to 40%.

5. Dynamic Depth and Layer Routing Across Model Architectures

Beyond expert allocation, dynamic token routing can be applied to control per-token depth, enabling skipping or repeating layers as needed. Radial Networks (Dotzel et al., 2024) employ a per-token router—a small MLP—to select, at each step, which transformer layer to visit next (including a special "output" layer to terminate the path). This design allows variable compute per token and decouples model depth from parameter count. Profiling reveals that deep residual blocks often contribute marginally to representation, so skipping layers per token yields substantial compute savings and enables larger capacity models within resource constraints.

Recursive Transformers with Mixture-of-Recursions (MoR) (Bae et al., 14 Jul 2025) further unify parameter sharing with dynamic depth. Routers select, for each token, the number of recursions to apply, focusing computation on hard tokens and enabling parameter-efficient scaling.

6. Practical Implications, Scalability, and Limitations

Dynamic token routing mechanisms are critical for scaling transformers and MoE models to increasing data sizes and sequence lengths. Empirical evidence from SeqTopK (Wen et al., 9 Nov 2025), LD-MoLE (Zhuang et al., 30 Sep 2025), DTRNet (Sharma et al., 31 Aug 2025), mixSGA (Song et al., 16 Jun 2025), and MEMatte (Lin et al., 2024) demonstrates consistent accuracy improvements, substantial compute/memory savings (up to 88% memory reduction in vision tasks), and better load balancing.

The scalability and gains from dynamic routing especially manifest under conditions of extreme sparsity—when total budget or expert quota is small relative to model capacity, and when token computational requirements are highly heterogeneous.

Current limitations include non-causality in some global routing schemes (addressed via online modifications), rare token over-concentration (mitigated by simple token-level caps), and the need for per-layer or per-token regularization to prevent trivial router collapse or capacity wastage. The field is rapidly advancing toward even more global, hierarchical, and context-aware routing strategies, including multimodal alignment (Mixture of States (Liu et al., 15 Nov 2025)) and highly fine-grained masking (Pure-Pass (Wu et al., 2 Oct 2025)).

7. Key Technical Results and Benchmark Findings

Across text, vision, retrieval, and multimodal generation, dynamic token routing strategies achieve Pareto-optimal trade-offs between accuracy and computation. Representative results include:

Model/Method	Domain	Notable Metrics	Reference
SeqTopK	LLMs	+5.9–16.9% gain under increasing sparsity	(Wen et al., 9 Nov 2025)
LD-MoLE	LLM-MoE	+3–4% avg. across diverse reasoning tasks	(Zhuang et al., 30 Sep 2025)
MaxScore	NLU	+1.33% average accuracy vs. GShard	(Dong et al., 18 Aug 2025)
DTRNet	LLMs	10% tokens use attention (0.79× FLOPs at 20k seq)	(Sharma et al., 31 Aug 2025)
MEMatte	Vision	88% lower memory, 50% lower latency	(Lin et al., 2024)
MambaFormer	Clinical QA	0.918 F1 at 0.077s latency (24.4× speedup vs T5-Large)	(Khan et al., 3 Jan 2026)
DiT	ImageNet	+1.0% Top-1 accuracy at 10GFLOPs	(Ma et al., 2023)
CITADEL	Retrieval	40× GPU speedup over ColBERT-v2	(Li et al., 2022)

These results anchor dynamic token routing as a foundational technique in the efficient scaling of large models for real-world tasks.

Dynamic token routing encompasses a spectrum of methods that intelligently direct compute and memory resources at the token level. By learning sequence-global, similarity-driven, or hybrid routing policies, modern architectures achieve improved efficiency, robustness, and accuracy, enabling new frontiers in deep model deployment under resource constraints.