Papers
Topics
Authors
Recent
Search
2000 character limit reached

Test-Time Training Done Right

Published 29 May 2025 in cs.LG, cs.CL, and cs.CV | (2505.23884v1)

Abstract: Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often <5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small minibatch implies fine-grained block-wise causal dependencies in the data, unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by using an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). It improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameters), hence substantially improving state capacity, all without requiring cumbersome and error-prone kernel implementations. It also allows easy integration of sophisticated optimizers, e.g. Muon for online updates. We validate our approach across diverse modalities and tasks, including novel view synthesis with image set, LLMs, and auto-regressive video diffusion. Our approach can scale up to 14B-parameter AR video diffusion model on sequences up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with 1 million context length. We hope this work will inspire and accelerate new research in the field of long-context modeling and test-time training. Website: https://tianyuanzhang.com/projects/ttt-done-right

Summary

  • The paper's main contribution is LaCT, which updates weights in large token chunks to boost hardware utilization and performance.
  • It introduces a novel architecture combining window attention with large-chunk updates to efficiently manage long-context sequences.
  • Results show significant improvements in language modeling, novel view synthesis, and autoregressive video generation.

"Test-Time Training Done Right": A Technical Overview

Introduction

The paper "Test-Time Training Done Right" addresses the challenges of efficiently handling long-context sequences during inference. Test-Time Training (TTT) enhances model adaptability by updating certain weights during inference, akin to recurrent states in RNNs. However, existing TTT methods suffer from low hardware utilization and inefficiencies in handling long sequences. This paper proposes Large Chunk Test-Time Training (LaCT) to overcome these limitations by updating weights in large chunks, significantly improving hardware utilization and performance across various tasks.

Large Chunk Test-Time Training

LaCT diverges from conventional TTT approaches by operating on large token chunks ranging from 2K to 1M, as opposed to the typical 16 to 64 tokens. This strategic shift enhances GPU utilization and enables scaling of nonlinear state sizes, dramatically improving the model's memory capacity. The method eschews complex kernel implementations, facilitating integration with sophisticated optimizers like Muon for online memory updates. Figure 1

Figure 1: Using larger chunk sizes significantly improves GPU utilization compared to the original TTT method.

Model Architecture and Implementation

The core of LaCT's architecture is its ability to handle N-dimensional data through large-chunk operations complemented by window attention for local dependencies within chunks. The LaCT block comprises a window attention layer, a large-chunk TTT layer, and a feed-forward layer. Residual connections and layer normalization follow the Transformer model's practice. Figure 2

Figure 2: The basic diagram for a LaCT block.

Large-Chunk TTT Layer

Unlike per-token updates, LaCT aggregates gradients over large chunks for weight updates. This approach facilitates better parallelism and significantly enhances the scaling and effectiveness of fast weights, achieving state-to-parameter size ratios up to 40%.

Execution and Parallelism Strategies

The model employs various execution orders for update and apply operations, tailoring data dependencies akin to attention masks in Transformers. Parallelism across sequence dimensions is achieved through context parallelism and tensor parallelism, handled efficiently by PyTorch.

Applications and Results

LaCT's versatility is demonstrated across tasks such as novel view synthesis, language modeling, and autoregressive video generation:

  • Novel View Synthesis: LaCT processes up to 1M tokens and outperforms models like 3D Gaussian Splatting in high-resolution datasets, achieving competitive rendering speeds and quality. Figure 3

    Figure 3: Achievements in novel view synthesis with LaCT.

  • Language Modeling: The model excels in leveraging long-context sequences, achieving lower validation losses at larger token indices compared to GLA and DeltaNet. Figure 4

    Figure 4: LLM results showing superior long-context modeling capability.

  • Autoregressive Video Generation: LaCT maintains validation loss competitiveness with full-attention models and enhances performance over other baselines like Mamba with sliding window attention. Figure 5

    Figure 5: Evaluation of LaCT on autoregressive video generation.

Analysis and Future Directions

The paper highlights the benefits of large state size, advanced optimizers like Muon, and nonlinear fast weights in improving model performance. A notable finding is that the combination of larger nonlinear states and Muon test-time optimization surpasses per-token recurrence methods in language tasks. Figure 6

Figure 6: Impact of state size scaling and test-time optimizers.

Conclusion

LaCT paves the way for efficient, scalable long-context sequence modeling by maximizing hardware utilization and simplifying integration with advanced optimization techniques. Its open-source release aims to catalyze further research and innovation in long-context modeling architectures.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 150 likes about this paper.