Papers
Topics
Authors
Recent
Search
2000 character limit reached

Kinetics: Rethinking Test-Time Scaling Laws

Published 5 Jun 2025 in cs.LG and cs.CL | (2506.05333v3)

Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

Summary

  • The paper introduces the Kinetics scaling law, emphasizing attention costs over parameter counts during model inference.
  • It demonstrates that models above 14B parameters combined with sparse attention offer significant gains in efficiency and problem-solving performance.
  • Sparse Kinetics, implemented via block top-k attention, reduces memory bottlenecks and enhances throughput for scalable LLM deployment.

Kinetics: Rethinking Test-Time Scaling Laws

Introduction

The paper "Kinetics: Rthinking Test-Time Scaling Laws" (arXiv ID: (2506.05333)) presents a novel perspective on test-time scaling laws by re-evaluating them through the lens of practical efficiency, focusing on the memory access bottlenecks encountered during inference. Unlike prior approaches that primarily consider compute-optimality, this work underscores the significance of attention-related costs in the test-time compute landscape. It introduces the Kinetics scaling law, which prioritizes models with parameter sizes above a key threshold, emphasizing that attention mechanisms, rather than parameter counts, are pivotal in determining inference costs.

Test-time scaling strategies such as Best-of-NN and Long-CoT have gained traction for their ability to enhance the reasoning capabilities of LLMs. However, they impose substantial inference-time costs, with memory access often overshadowing pure computational costs. The paper explores how traditional scaling laws overestimate the benefits of smaller models, advocating instead for resource allocation to focus on larger models and sparse attention technologies. Figure 1

Figure 1

Figure 1: Pareto Frontier for Qwen3 series on AIME24 with Long-CoTs highlighting memory bottlenecks.

Kinetics Scaling Law

Kinetics is derived by incorporating the disproportionate increase in attention costs over parameter counts during test-time, challenging the assumption that smaller models inherently offer efficiency when paired with scaling strategies. A critical observation is that extending model size beyond 14 billion parameters yields superior test-time compute benefits, shifting the strategy from merely extending sequence lengths. Figure 2

Figure 2

Figure 2

Figure 2: Inference cost dominance by attention, showcasing the substantial impact of attention-related computations.

The paper introduces Sparse Kinetics, a paradigm where sparse attention allows for reduced per-token costs, enabling longer and more parallel generations within the same computational budget. Sparse attention models consistently outperform dense counterparts, realizing impressive gains in problem-solving accuracy in high-cost scenarios.

Empirical results demonstrate that sparse attention can enhance problem-solving rates by over 60 points in low-cost regimes, maintaining a performance lead even at higher budgets. These findings solidify the importance of considering hardware efficiency and attention sparsification as primary factors in test-time scaling.

Sparse Attention and Implementation

Sparse attention fundamentally transforms the test-time scaling landscape. The paper proposes practical applications of block top-k attention to simplify the implementation without sacrificing the effectiveness. This approach efficiently handles memory bottlenecks, thereby facilitating a substantial increase in throughput. Figure 3

Figure 3

Figure 3

Figure 3: Block top-k attention showcases trade-offs between simplicity and performance effectiveness.

The empirical implementation uses Flashinfer and optimized libraries to demonstrate practical efficiency gains in task throughput. Block top-k attention, while not as optimal as oracle top-k, provides a tangible path toward scalable and efficient LLM inference.

Implications and Future Directions

Sparse Kinetics marks a significant shift in the deployment strategy of LLMs. By integrating sparse attention mechanisms, the paper opens avenues for advancing both inference efficiency and model architecture design. Future work could further explore dynamic sparsity patterns and adaptive resource allocation in response to task complexity, potentially leading to even greater optimization in test-time scaling. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Sparse attention boosts test-time scaling, evidencing substantial improvements in efficiency and accuracy.

The transition from token-centric metrics to task-level throughput highlights the broader applicability and societal utility of generative models when implemented efficiently. Co-design between hardware and algorithmic strategies stands out as a pivotal direction for future research, accelerating the march toward sustainable and scalable AI deployment.

Conclusion

The paper establishes the Kinetics as a pivotal test-time scaling law, emphasizing the critical role of attention costs. By transitioning to sparse attention paradigms, this study underlines the pathway to more efficient and scalable LLM deployment. As we anticipate further advancements in model architecture and inference systems, sparse attention is poised to reshape the contours of AI scalability beyond the limits of current pretraining paradigms.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 214 likes about this paper.