Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration

Published 23 May 2025 in cs.AR | (2505.17787v1)

Abstract: LLMs have gained great success in various domains. Existing systems cache Key and Value within the attention block to avoid redundant computations. However, the size of key-value cache (KV cache) is unpredictable and can even be tens of times larger than the weights in the long context length scenario. In this work, we propose Titanus, a software-hardware co-design to efficiently compress the KV cache on-the-fly. We first propose the cascade pruning-quantization (CPQ) method to reduce the KV cache movement. The hierarchical quantization extension strategy is introduced to tackle the non-independent per-channel quantization issue. To further reduce KV cache movement, we transfer only the non-zero KV cache between the accelerator and off-chip memory. Moreover, we customize a two-stage design space exploration framework for the CPQ method. A novel pipeline and parallelism dataflow is designed to reduce the first token generation time. Experiments show that Titanus achieves 159.9x (49.6x) and 34.8x (29.2x) energy efficiency (throughput) compared to Nvidia A100 GPU and FlightLLM respectively. The code for Titanus is available at https://github.com/peilin-chen/Titanus-for-LLM-acceleration.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel cascade pruning-quantization (CPQ) method that dynamically prunes insignificant KV cache elements and quantizes remaining data, achieving up to 159.9× energy efficiency improvement.
It leverages a Compute-in-Memory design and dedicated pruning units to reduce redundant computations, resulting in a 34.8× throughput gain over Nvidia A100 GPUs.
Experimental results validate Titanus's effectiveness, showing a 58.9% reduction in KV cache movements and significant improvements in both energy efficiency and throughput for LLM inference.

Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration

Introduction

The continual growth in the scale of LLMs has directed attention towards optimizing the computational and memory demands of model inference. Titanus proposes a sophisticated software-hardware co-design that performs KV cache pruning and quantization dynamically, or "on-the-fly", aimed at improving the efficiency of LLMs during inference. The core innovation, the Cascade Pruning-Quantization (CPQ) method, combined with a Hierarchical Quantization Extension strategy, achieves significant reductions in energy consumption and data movement overheads.

Algorithm Design

CPQ Compression Method

In LLMs, the cascade of pruning followed by quantization represents a potent strategy to minimize resource consumption. The CPQ method begins with pruning insignificant elements from the KV cache, determined by a threshold. Subsequent quantization focuses on non-zero elements, encapsulating both processes into a coherent flow that minimizes computational overhead.

Figure 1: Overview of cascade pruning-quantization method.

Hierarchical Quantization Extension Strategy

This strategy addresses the challenges associated with non-independent per-channel quantization. The hierarchical nature allows dynamic adjustments in precision requirements for newly generated tokens during inference, avoiding repetitive quantization of tokens and ensuring minimal memory overhead with efficient handling of quantization parameters.

Figure 2: Hierarchical quantization extension strategy.

Hardware Architecture

Titanus Structure

Titanus exploits a Compute-in-Memory (CIM) design to hold static weights and processes using Digital CIM macros. The architecture integrates innovative units for pruning and quantization, which dynamically manage KV cache elements to efficiently leverage sparsity and reduce unnecessary data movement.

Figure 3: Titanus core-level overall architecture. CE and SZ denote the computing engine and scale-zero buffer, respectively.

Computing Engine and Pruning Unit

The Computing Engine features a zero-detection mechanism to skip redundant computations on sparse data, while the Pruning Unit selectively processes non-zero KV cache elements, directly supporting the CPQ method.

Figure 4: Computing engine design for dot-product attention with zero detection functionality.

Experimental Results

Empirical evaluation of Titanus demonstrated substantial improvements in both energy efficiency and throughput, attaining $159.9\times$ energy efficiency and $34.8\times$ throughput improvements over traditional configurations using Nvidia A100 GPUs. The CPQ method significantly reduced KV cache movements by $58.9\%$ .

Figure 5: Energy efficiency~(Token/J) and throughput~(Token/s) improvement.

Conclusion

Titanus offers a robust framework for accelerating LLMs through strategic KV cache optimization, leveraging hardware-specific designs to maximize performance gains. The integration of algorithmic pruning and dynamic quantization represents a practical solution to the computational challenges posed by large-scale LLM inference, paving the way for more efficient deployment of LLMs across diverse applications.

Markdown Report Issue