Papers
Topics
Authors
Recent
Search
2000 character limit reached

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Published 8 May 2026 in cs.CL, cs.AI, and cs.LG | (2605.07721v1)

Abstract: Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.

Summary

  • The paper presents MELT, which decouples compute from memory by updating a single KV cache per layer using a learnable gating mechanism.
  • MELT achieves constant-memory complexity and outperforms baseline models on mathematical and general reasoning benchmarks.
  • The study employs techniques like attention-aligned distillation and chunk-wise training to validate MELT's efficiency and performance gains.

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped LLMs

Introduction and Motivation

Scaling inference-time compute, rather than merely increasing parameter count or training steps, has become a dominant strategy for advancing the reasoning capability of LLMs. Looped transformer architectures such as LoopLM and Ouro leverage recurrent application of transformer layers in the embedding space, enabling multi-step latent reasoning without additional token generation. However, existing looped approaches maintain a separate KV cache per layer for every reasoning loop, forcing memory consumption to scale linearly with the number of iterations. Consequently, while architectural recurrence unlocks significant gains in reasoning, practical deployments of looped LLMs are undermined by prohibitive memory requirements as depth increases.

The MELT (Memory-Efficient Looped Transformer) framework addresses this bottleneck by decoupling iterative compute from memory usage. MELT maintains a single KV cache per layer, dynamically updated across recurrent reasoning steps by a learnable gating mechanism, thereby achieving constant memory with respect to reasoning depth.

Architectural Innovations

MELT adapts the LoopLM paradigm by introducing a key difference: per-layer KV caches are not appended but updated in-place using a learnable element-wise gating function. This is operationalized by introducing a latent state updated recurrently within each layer. Instead of the standard approach where O(Nâ‹…Lâ‹…T)O(N \cdot L \cdot T) memory is required (with NN layers, LL sequence length, TT loops), MELT attains O(Nâ‹…L)O(N \cdot L) memory complexity. The gating function manages information flow across loops, allowing each token, at each step, to attend to keys and values that integrate all prior reasoning.

The latent state update follows:

ht=zt⊙ht−1+(1−zt)⊙xt,h_t = z_t \odot h_{t-1} + (1-z_t) \odot x_t,

where xtx_t is the current hidden state, ht−1h_{t-1} is the state from the previous loop, and ztz_t is a vector-valued gate, all per token and per layer. This state is projected to produce the key and value representations for attention, ensuring that only a single, semantically aligned KV pair per token/layer is retained throughout all loops. By preserving query-key alignment and supporting full latent reasoning, MELT effectively replicates the computational power of deep looped stacks at a fixed memory budget.

Training Procedure

Transitioning pretrained looped transformers (Ouro) to MELT dynamics presents optimization challenges, given the non-trivial shift in memory and loop handling. MELT employs a two-phase, data-efficient adaptation protocol:

  1. Interpolated Transition: For a smooth shift from LoopLM to MELT, forward passes compute both standard and MELT-style KV pairs at each layer/loop, using a mixing coefficient α\alpha that anneals from NN0 (pure LoopLM) to NN1 (pure MELT) during fine-tuning.
  2. Attention-Aligned Distillation: Once MELT dynamics are adopted, a frozen LoopLM teacher guides MELT by enforcing layer-and-loop-wise alignment of post-attention representations through an auxiliary loss and standard KD. This substantially mitigates representation drift and stabilizes convergence.

Chunk-wise training is used for efficiency, trading off exact autoregressive fidelity against throughput. The sequence is split into chunks, enabling parallelism within chunks and capturing the sequential KV update dependencies across chunk boundaries.

Empirical Results

MELT was evaluated comprehensively against both recurrent and non-recurrent LLMs (Qwen3, Gemma, DeepSeek, Ouro) on a suite of mathematical and general reasoning benchmarks. Across all mathematical reasoning datasets (AIME, MATH-500, AMC, OlympiadBench), MELT-1.6B surpassed non-looped baselines of similar parameter counts, while matching or closely approaching Ouro's performance but with up to a 3-4x reduction in KV cache memory.

On Humaneval, MELT outperformed Ouro, and on MMLU it matched or exceeded larger transformers. The results demonstrate that constant-memory iterative reasoning is achievable without sacrificing the gains expected from deep recurrent compute. Memory profiling for 32k-token contexts revealed that MELT's VRAM requirements track standard transformers closely and are dramatically lower than uncompressed looped architectures.

Ablation studies confirmed that MELT’s performance is contingent on the full suite of training innovations—removal of attention-aligned distillation, interpolated transition, KD, or chunk-wise training each resulted in consistent and significant degradation, validating the necessity of the holistic adaptation pipeline.

Analysis of Gating and Memory Dynamics

Replacing the element-wise gating with coarser alternatives (scalar gating, mean/EMA/or last-step aggregation) resulted in uniformly lower accuracy, affirming that fine-grained, learnable element-wise gating is critical for optimal integration of loop-specific reasoning traces. Theoretical analysis demonstrates that the gating mechanism provides spectral regulation, ensuring constant Jacobian norm in the saturated regime and thus mitigating vanishing/exploding gradients—yielding a stable "gradient superhighway" for deep loop optimization.

Limitations and Prospective Directions

MELT currently inherits Ouro's fixed loop count at inference and does not yet support MQA, which could further compress KV memory usage. The sequential dependency in chunk-wise training constrains full parallelism, which may hinder scalability for even larger models or more complex tasks. Future research could incorporate adaptive, token-wise loop count (early-exiting), explore integration with head-sharing mechanisms, and develop more parallelizable training approaches.

The constant-memory latent-state update architecture is naturally suited for adaptive loop-depth strategies, which can dynamically allocate computational budget based on task or token complexity—potentially unlocking further capabilities on resource-constrained hardware.

Conclusion

MELT introduces a principled mechanism for decoupling compute from memory in looped LLMs, enabling deep latent reasoning within a fixed memory envelope. By consolidating KV state via a gated recurrent latent update and employing careful distillation-based adaptation, MELT maintains the strong reasoning of looped architectures while eliminating their key practical bottleneck. This work establishes that enhanced reasoning through iterative computation does not require accepting unsustainable memory usage, making looped LLMs viable for longer contexts and broader deployment.

The implications are significant for both efficient LLM scaling and for architectures that seek to adaptively allocate compute at inference, permitting a new class of memory-aware, performant reasoning models. Given its generality, MELT points toward potential advances in dynamic depth, universal computation, and resource-constrained LLM deployment scenarios.


Reference: "Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped LLMs" (2605.07721)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.