- The paper presents MELT, which decouples compute from memory by updating a single KV cache per layer using a learnable gating mechanism.
- MELT achieves constant-memory complexity and outperforms baseline models on mathematical and general reasoning benchmarks.
- The study employs techniques like attention-aligned distillation and chunk-wise training to validate MELT's efficiency and performance gains.
Introduction and Motivation
Scaling inference-time compute, rather than merely increasing parameter count or training steps, has become a dominant strategy for advancing the reasoning capability of LLMs. Looped transformer architectures such as LoopLM and Ouro leverage recurrent application of transformer layers in the embedding space, enabling multi-step latent reasoning without additional token generation. However, existing looped approaches maintain a separate KV cache per layer for every reasoning loop, forcing memory consumption to scale linearly with the number of iterations. Consequently, while architectural recurrence unlocks significant gains in reasoning, practical deployments of looped LLMs are undermined by prohibitive memory requirements as depth increases.
The MELT (Memory-Efficient Looped Transformer) framework addresses this bottleneck by decoupling iterative compute from memory usage. MELT maintains a single KV cache per layer, dynamically updated across recurrent reasoning steps by a learnable gating mechanism, thereby achieving constant memory with respect to reasoning depth.
Architectural Innovations
MELT adapts the LoopLM paradigm by introducing a key difference: per-layer KV caches are not appended but updated in-place using a learnable element-wise gating function. This is operationalized by introducing a latent state updated recurrently within each layer. Instead of the standard approach where O(Nâ‹…Lâ‹…T) memory is required (with N layers, L sequence length, T loops), MELT attains O(Nâ‹…L) memory complexity. The gating function manages information flow across loops, allowing each token, at each step, to attend to keys and values that integrate all prior reasoning.
The latent state update follows:
ht​=zt​⊙ht−1​+(1−zt​)⊙xt​,
where xt​ is the current hidden state, ht−1​ is the state from the previous loop, and zt​ is a vector-valued gate, all per token and per layer. This state is projected to produce the key and value representations for attention, ensuring that only a single, semantically aligned KV pair per token/layer is retained throughout all loops. By preserving query-key alignment and supporting full latent reasoning, MELT effectively replicates the computational power of deep looped stacks at a fixed memory budget.
Training Procedure
Transitioning pretrained looped transformers (Ouro) to MELT dynamics presents optimization challenges, given the non-trivial shift in memory and loop handling. MELT employs a two-phase, data-efficient adaptation protocol:
- Interpolated Transition: For a smooth shift from LoopLM to MELT, forward passes compute both standard and MELT-style KV pairs at each layer/loop, using a mixing coefficient α that anneals from N0 (pure LoopLM) to N1 (pure MELT) during fine-tuning.
- Attention-Aligned Distillation: Once MELT dynamics are adopted, a frozen LoopLM teacher guides MELT by enforcing layer-and-loop-wise alignment of post-attention representations through an auxiliary loss and standard KD. This substantially mitigates representation drift and stabilizes convergence.
Chunk-wise training is used for efficiency, trading off exact autoregressive fidelity against throughput. The sequence is split into chunks, enabling parallelism within chunks and capturing the sequential KV update dependencies across chunk boundaries.
Empirical Results
MELT was evaluated comprehensively against both recurrent and non-recurrent LLMs (Qwen3, Gemma, DeepSeek, Ouro) on a suite of mathematical and general reasoning benchmarks. Across all mathematical reasoning datasets (AIME, MATH-500, AMC, OlympiadBench), MELT-1.6B surpassed non-looped baselines of similar parameter counts, while matching or closely approaching Ouro's performance but with up to a 3-4x reduction in KV cache memory.
On Humaneval, MELT outperformed Ouro, and on MMLU it matched or exceeded larger transformers. The results demonstrate that constant-memory iterative reasoning is achievable without sacrificing the gains expected from deep recurrent compute. Memory profiling for 32k-token contexts revealed that MELT's VRAM requirements track standard transformers closely and are dramatically lower than uncompressed looped architectures.
Ablation studies confirmed that MELT’s performance is contingent on the full suite of training innovations—removal of attention-aligned distillation, interpolated transition, KD, or chunk-wise training each resulted in consistent and significant degradation, validating the necessity of the holistic adaptation pipeline.
Analysis of Gating and Memory Dynamics
Replacing the element-wise gating with coarser alternatives (scalar gating, mean/EMA/or last-step aggregation) resulted in uniformly lower accuracy, affirming that fine-grained, learnable element-wise gating is critical for optimal integration of loop-specific reasoning traces. Theoretical analysis demonstrates that the gating mechanism provides spectral regulation, ensuring constant Jacobian norm in the saturated regime and thus mitigating vanishing/exploding gradients—yielding a stable "gradient superhighway" for deep loop optimization.
Limitations and Prospective Directions
MELT currently inherits Ouro's fixed loop count at inference and does not yet support MQA, which could further compress KV memory usage. The sequential dependency in chunk-wise training constrains full parallelism, which may hinder scalability for even larger models or more complex tasks. Future research could incorporate adaptive, token-wise loop count (early-exiting), explore integration with head-sharing mechanisms, and develop more parallelizable training approaches.
The constant-memory latent-state update architecture is naturally suited for adaptive loop-depth strategies, which can dynamically allocate computational budget based on task or token complexity—potentially unlocking further capabilities on resource-constrained hardware.
Conclusion
MELT introduces a principled mechanism for decoupling compute from memory in looped LLMs, enabling deep latent reasoning within a fixed memory envelope. By consolidating KV state via a gated recurrent latent update and employing careful distillation-based adaptation, MELT maintains the strong reasoning of looped architectures while eliminating their key practical bottleneck. This work establishes that enhanced reasoning through iterative computation does not require accepting unsustainable memory usage, making looped LLMs viable for longer contexts and broader deployment.
The implications are significant for both efficient LLM scaling and for architectures that seek to adaptively allocate compute at inference, permitting a new class of memory-aware, performant reasoning models. Given its generality, MELT points toward potential advances in dynamic depth, universal computation, and resource-constrained LLM deployment scenarios.
Reference: "Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped LLMs" (2605.07721)