Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Published 4 Mar 2025 in cs.CL and cs.AI | (2503.02812v1)

Abstract: Autoregressive LLMs rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Q-Filters, a training-free method that uses SVD-based analysis of Query vectors to filter and compress KV caches in autoregressive models.
It achieves up to ×32 compression while maintaining lower perplexity in long-context language modeling without needing model retraining.
The method offers a context-agnostic approach, compatible with efficient attention mechanisms like FlashAttention, and scales across diverse benchmarks.

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Introduction

The paper "Q-Filters: Leveraging Query-Key Geometry for Efficient KV Cache Compression" (2503.02812) addresses the challenge of memory bottlenecks in autoregressive LLMs, where the Key-Value (KV) Cache grows with increasing model sizes and context lengths. As LLMs evolve to handle longer contexts, the KV Cache poses a significant hurdle due to its memory demands, slowing down inference times. Q-Filters is proposed as a training-free method to compress the KV Cache without requiring access to attention weights, making it compatible with efficient attention algorithms like FlashAttention.

Figure 1: Accuracy vs Time to First Token (TTFT) tradeoff for Llama-3.1-70B-Instruct, measured on the Ruler dataset with $\times$ 8 compression.

Methodology

Q-Filters work by exploring the geometrical properties of Query (Q) and Key (K) vectors. The authors identify that the main direction in the hidden space, spanned by the principal eigenvector of $Q$ , can efficiently estimate the importance of Key-Value pairs. This direction is context-agnostic and therefore allows for a robust filtering approach across different tasks and datasets.

In practice, Q-Filters uses Singular Value Decomposition (SVD) of Query vectors from a calibration dataset to determine principal directions, which serve as the filters. During inference, Key vectors are projected onto these predefined directions to evaluate their relevance, maintaining compatibility with models and attention mechanisms that do not provide direct access to attention weights.

Figure 2: Spearman rank correlation between KV compression scoring metrics and the observed attention $S^h$ for Llama-3.2-1B, for K-norm (top) and Q-Filters (bottom).

Experimental Results

The paper demonstrates the effectiveness of Q-Filters across several benchmarks, including language modeling on the Pile dataset, needle-in-a-haystack retrieval tasks, and the Ruler dataset for long context modeling. Notably, Q-Filters achieved competitive results with up to $\times$ 32 compression levels, reducing perplexity drops significantly compared to alternatives such as Streaming-LLM.

Figure 3: Generation performance for a KV Cache size limited to 512 items for Llama-3.1-8B (top) and Llama-3.1-70B (bottom).

In memory-constrained scenarios, Q-Filters consistently results in lower perplexity, indicating better retention of crucial context information. The approach excels particularly in scenarios requiring efficient query and retrieval across very long sequences due to its context-agnostic design.

Discussion

Q-Filters offer a compelling balance between computational efficiency and performance, highlighting its utility for real-time applications requiring rapid inference along extensive prompts. The method's ability to operate without retraining the model or accessing specific attention mechanisms positions it as a pivotal contribution to KV Cache compression.

While the primary focus is on geometric considerations in Query-Key relations, future developments may explore adaptive Q-Filters generation and integration with emerging attention acceleration techniques. The scalability and broad applicability of Q-Filters suggest potential paths for deployment in commercial settings that demand versatile model efficiency without sacrificing accuracy.

Figure 4: First token latency across KV Cache compression methods of Llama-3.2-8B with a length of 64k prompt.

Conclusion

The introduction of Q-Filters marks a significant advancement in addressing the challenges posed by expanding context lengths in autoregressive models. By leveraging the geometric properties of Query and Key vectors, Q-Filters provide a context-agnostic, training-free solution for efficient KV Cache compression. This approach not only reduces memory footprint but also enhances inference speed and accuracy across diverse applications, contributing notably to the practical scalability of long-context LLMs.