Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse Layers are Critical to Scaling Looped Language Models

Published 9 May 2026 in cs.LG and cs.CL | (2605.09165v1)

Abstract: Looped LLMs repeat a set of transformer layers through depth, reducing memory costs and providing natural early-exit points at loop boundaries. However, looped models do not scale as favorably as standard transformers with unique layers. We compare standard and Mixture-of-Experts (MoE) transformers, with and without looping, and find two main results. First, we find Looped-MoE models scale better than the standard baseline while dense looped models do not. We trace this to routing divergence between loops: in Looped-MoE models, different experts are activated on each pass through the same shared layers, recovering expressivity without additional parameters. Our second finding is that looped models have better compute-quality trade-offs with early exits than standard models. Because each loop ends with the same layers that produce the final output, loop boundaries are superior exit points, as confirmed by earlier output convergence at these points. In sum, we provide a clear direction for scaling looped models: a Looped-MoE model with early exits can not only beat standard transformers at scale, but also enable significant memory and inference savings with minimal degradation in quality.

Summary

  • The paper demonstrates that replacing dense feedforward layers with sparse MoE layers in looped transformers recovers expressivity lost through weight tying.
  • The methodology employs parameter reuse, isoFLOP scaling laws, and precise numerical benchmarks to compare looped architectures with dense baselines.
  • The findings suggest that sparse expert routing in looped models optimizes compute-quality tradeoffs and early-exit strategies, paving the way for efficient deployment.

Sparse Layers are Critical to Scaling Looped LLMs

Introduction

Looped LLMs (LoopLMs) employ parameter reuse by iteratively applying a stack of transformer layers through depth, minimizing memory costs and enabling early-exit strategies during inference. However, empirical results consistently show that naïvely looping dense transformer layers leads to inferior scaling compared to standard non-looped transformers, particularly under fixed-compute scenarios governed by established scaling laws [kaplan_scaling_2020], [hoffmann_training_2022]. This work systematically investigates the role of sparse Mixture-of-Experts (MoE) feedforward layers as a mechanism to break this scaling barrier in looped architectures. By explicitly isolating the contribution of sparse computation, the study sharply delineates the regime where looped architectures, with proper sparsification, can beat traditional transformers both in scaling and inference quality-cost profile. Figure 1

Figure 1

Figure 1: Looped-MoE models (with repeated stacks and sparse MoE FFNs) outperform dense base transformers in isoFLOP test loss and compute-quality early-exit tradeoffs. Larger numbers of loops further improve these tradeoffs.

Architectural Variants and Parameterization

The comparative analysis centers on four classes of decoder-only transformer models, distinguished by two axes: (a) dense versus sparse (MoE) feedforward layers, and (b) unique (Base/MoE) versus looped (Looped/Looped-MoE) stacking. All employ rotary positional embeddings, pre-RMSNorm, and SwiGLU activations. In looped variants, a reduced set of LL unique layers is repeatedly applied RR times such that effective depth matches non-looped baselines. MoE layers use token-level top-kk routing with k=2k=2 out of E=8E=8 experts per position, with auxiliary load balancing and router z-losses to maintain expertise diversity and stability.

Optimal learning rate transfer across widths is validated via Maximal Update Parameterization (μ\muP) [yang_tensor_2022], achieving consistent loss curves and <1% deviation in optimality—a critical property for efficient isoFLOP scaling law studies. Figure 2

Figure 2: The µP transfer test demonstrates robust learning rate transfer across model widths; optimal small-model hyperparameters remain close to optimal as scale increases.

Scaling Laws and Numerical Outcomes

IsoFLOP scaling laws—stated as test loss LL obeying LNαL \propto N^{-\alpha} at fixed training compute C=6NDC=6ND—reveal Looped-MoE models decisively close the scaling gap induced by dense looping. In head-to-head comparisons at fixed compute and parameter count, Looped-MoE achieves lower test loss than both dense Looped and dense Base transformers. Derived scaling exponents α0.076\alpha\sim 0.076RR0 match classic results for standard transformers, demonstrating that it is the sparse routing within looped models, not looping itself, that is essential to recover expressivity.

Critically, in compute-matched scenarios, Looped-MoE models with parameter tying and expert sparsity not only outperform their dense looped counterparts but also match or exceed both base dense and ungated MoE transformers for both test loss and average benchmark performance. For example, on the AI2 OLMES Core-9 benchmark suite, Looped-MoE achieves the highest mean score (39.6) with fewer unique parameters than Base (216M vs 246M), while vanilla MoE—despite best test loss—scores lowest on these downstream tasks, likely due to limited per-token expert coverage. Figure 3

Figure 3

Figure 3: IsoFLOP scaling curves: Left, base transformer; Right, Looped-MoE. Compute-optimal model sizes at each budget and fitted RR1 power-laws confirm that sparse expert routing in the looped regime restores scaling.

Figure 4

Figure 4: Combined scaling for all architectures. MoE and Looped-MoE display steeper scaling offsets compared to dense models, reaffirming that sparsity, not merely looping, drives improved scaling.

Early-Exit Properties and Compute-Quality Tradeoffs

Early exit mechanisms, especially entropy-threshold-based criteria, are explored systematically in both dense and looped architectures. Looped and Looped-MoE models exhibit strictly superior compute–quality Pareto frontiers compared to dense models, as measured by perplexity at fixed fractions of skipped FLOPs. At 10% FLOPs saved, Looped/Looped-MoE maintain significantly lower perplexity than both base and MoE, e.g., Looped-MoE RR2 at 42.0 vs Base at 55.4 and MoE at 75.7.

This advantage is traced to the structural alignment between loop boundaries and final output layers in looped models: at loop boundaries, the activations are processed by the same stack responsible for the model’s end-of-depth output, facilitating much earlier convergence to near-final distributions. Jensen-Shannon divergence (JSD) analyses further corroborate that at these boundaries, a substantially higher fraction of tokens reach near-final output distribution states, both reducing wasted computation and improving early-exit accuracy. Figure 5

Figure 5

Figure 5: Schematic and empirical compute–quality Pareto curves demonstrate looped architectures yield inherently better early-exit profiles—tokens exit at pre-aligned loop boundaries, saving computation with limited quality degradation.

Figure 6

Figure 6: Distributional JSD analysis: tokens in looped models converge more quickly to their eventual output distributions at well-defined loop boundaries, confirming architectural benefits for early exit.

Mechanistic Explanation: Routing Divergence Across Loops

The core architectural insight is that sparse MoE routing enables looped models to sidestep the expressivity loss entailed by weight tying. In Looped-MoE, the expert routing decisions for a given token heavily diverge between repeated passes through the same physical layer, yielding different functional computations per loop iteration without additional parameters. Analysis across layers demonstrates that for the vast majority of tokens and physical layers, the assigned experts on subsequent loop passes are largely non-overlapping. This per-loop specialization is the dominant source of recovered expressivity. Figure 7

Figure 7: Expert assignment overlap between loop passes in Looped-MoE shows most tokens access distinct experts on different passes, demonstrating how expressivity is restored via per-loop specialization.

Effect of Loop Depth on Efficiency

Varying the number of loops (and thus, natural early-exit points) further improves compute-quality tradeoffs. As the number of loops increases (e.g., RR3 versus RR4 for 16 effective layers), more exit opportunities arise, each corresponding to a point where the model’s output is well-formed. Figure 8

Figure 8: Increasing loop count enhances the compute-quality tradeoffs—more loop boundaries provide more high-quality exit opportunities; Looped-MoE configurations dominate non-looped MoE regardless of exit restrictions.

Practical and Theoretical Implications

The empirical evidence shows that the simple substitution of dense FFNs with sparse MoE layers in looped transformers is both necessary and sufficient to close the scaling gap relative to dense baselines. This positions Looped-MoE as the optimal design point for practitioners aiming to minimize storage and inference costs at a given quality level. The mechanism is purely architectural; no additional supervision or training regime is required.

On the theoretical side, this work clarifies that the scaling deficit of looped architectures is caused by the interaction between weight tying and dense, input-invariant computation. Sparse, per-token expert routing injects the required per-depth heterogeneity for scaling laws to hold under weight sharing. The finding that looped models with MoE routing can outperform larger dense models on downstream tasks under matched compute suggests fertile ground for further investigation, particularly in test-time adaptive computation and hardware specialization.

Conclusion

Sparse Layers are Critical to Scaling Looped LLMs establishes that parameter tying in transformers, when combined with MoE routing, enables parameter- and compute-efficient architectures that scale comparably or better than traditional dense stacks. Looped-MoE not only matches or exceeds dense baselines in loss and downstream benchmark accuracy, but also unlocks superior early-exit strategies for inference acceleration. The architectural mechanisms underlying these results—specifically loop-divergent expert assignment and loop-boundary alignment for early exit—indicate clear, robust paths for the efficient deployment of increasingly deep LLMs. Future directions include extending to larger parameter regimes, integrating adaptive loop-depth mechanisms, and exploring the alignment of these architectural properties with emerging compute and memory-constrained hardware platforms.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 3 likes about this paper.