EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Published 3 Mar 2025 in cs.CL | (2503.01840v3)

Abstract: The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel approach by eliminating top-layer feature prediction constraints and using direct token prediction to speed up inference.
It leverages multi-layer feature fusion to capture richer semantic information, enhancing token accuracy and robustness.
Experimental results demonstrate up to a 6.5x speedup across multiple tasks, supporting efficient real-time LLM applications.

EAGLE-3: Scaling up Inference Acceleration of LLMs via Training-Time Test

Introduction

The paper "EAGLE-3: Scaling up Inference Acceleration of LLMs via Training-Time Test" explores advancements in speculative sampling methods for accelerating the inference of LLMs. The sequential computation requirements of LLMs result in high inference costs and latency. EAGLE-3 introduces techniques that significantly improve performance by focusing on direct token prediction and utilizing multi-layer feature fusion, discarding previous constraints associated with feature prediction. These methods aim to leverage increased training datasets to enhance inference acceleration without compromising on model quality (2503.01840).

Key Improvements in EAGLE-3

Removal of Feature Prediction Constraints: EAGLE-3 eliminates the reliance on predicting top-layer features and instead focuses on directly predicting tokens. The removal of this constraint allows a more flexible draft model and facilitates better scaling with increased training data.

Figure 1: Scaling law evaluated on the MT-bench using LLaMA-Instruct 3.1 8B as the target model, with the x-axis representing the data scale relative to ShareGPT. The new architectural designs in EAGLE-3 enable an increasing scaling curve, which was never observed in the previous works.

Multi-layer Feature Fusion: Unlike previous methods that use features from either a single layer or top-layer features, EAGLE-3 utilizes a fusion of low, middle, and high-level features. This approach captures richer semantic information, improving expressiveness and token prediction accuracy.

Figure 2: Diagram of the EAGLE-3 inference pipeline, illustrating the three steps of the draft model. $l$ , $m$ , and $h$ represent the low, middle, and high-level features of the target model, respectively. $e$ denotes the embedding.

Experimental Results

EAGLE-3 tested various open-source models, demonstrating superior performance across five tasks: multi-turn conversation, code generation, mathematical reasoning, instruction following, and summarization. The authors report a maximum speedup ratio of 6.5x and significant improvements over earlier versions like EAGLE-2, highlighting the effectiveness of EAGLE-3's architectural innovations.

Figure 3: Speedup ratios of different methods at temperature=0.

Acceptance Rates and Speedup: EAGLE-3 achieves notable improvements in acceptance rates of draft tokens compared to previous methods like EAGLE. The experiments reveal consistent acceptance rates across inputs with increased self-predicted values. This reflects the robustness of EAGLE-3’s training approach and its ability to maintain high performance even with enhanced input complexity.

Figure 4: Acceptance rate of EAGLE and EAGLE-3 on MT-bench, with the target model being LLaMA-Instruct 3.1 8B.

Implications and Future Directions

The enhancements provided by EAGLE-3 offer practical benefits in deploying LLMs for real-time applications where latency is a critical constraint. The observed scaling behavior with increased data suggests potential expansions in efficiency for even larger datasets and models. Future developments may explore further optimization avenues such as adaptive layer selection or integrating external knowledge bases to further boost LLM capabilities while minimizing computational overhead.

Conclusion

EAGLE-3 represents a significant step forward in speculative sampling methods, demonstrating notable improvements in inference acceleration for LLMs. By eschewing feature prediction constraints and adopting a multi-layer fusion approach, EAGLE-3 sets the stage for future advancements in efficient LLM deployment. This paper’s findings reveal a clear pathway towards scalable, fast, and efficient LLM inference (2503.01840).