PagedAttention Algorithm
- PagedAttention is an attention algorithm that segments the key-value cache into fixed-size pages, enabling efficient and scalable LLM inference.
- It employs virtual memory-inspired techniques such as on-demand allocation, copy-on-write, and reference counting to minimize memory waste and redundancy.
- The method integrates with production systems to significantly enhance throughput and reduce latency for long-context inference tasks.
@@@@1@@@@ is an attention algorithm and memory management approach for LLM inference, designed to eliminate the inefficiencies of conventional key-value (KV) caching via operating-system-inspired paging techniques. PagedAttention segments the KV cache into fixed-size blocks ("pages") that are dynamically allocated, mapped, and shared across requests, enabling near-zero memory waste, flexible KV sharing, and scalable support for long-context inference. By abstracting KV cache management to block-level operations, PagedAttention serves as a general foundation for further innovations such as per-head compression and fused attention kernel integration.
1. Virtual-Memory–Style KV Cache Paging
PagedAttention borrows from classical virtual memory management, treating each sequence’s KV cache as a collection of fixed-size logical pages (KV blocks), each holding the keys and values for consecutive tokens. Each sequence maintains a page table (block table) mapping logical pages to physical pages in GPU memory. Pages are allocated on demand rather than reserving contiguous memory for the maximal sequence length.
Physical KV blocks need not be contiguous in memory. Copy-on-write and reference counting allow multiple sequences—especially in scenarios such as beam search or parallel sampling—to share prefix blocks, avoiding redundant duplication. Allocation and reclamation of physical memory are performed in units of pages, enabling highly efficient and eviction-friendly memory usage (Kwon et al., 2023).
2. Paging Data Structures and Mechanisms
PagedAttention is underpinned by several core data structures:
- KV Block: Stores key and value vectors for tokens across layers and heads.
- Block Table: Per-sequence array indexed by logical block, storing physical block ID and occupancy.
- Reference Count Table: Tracks how many logical blocks map to each physical block for safe reclamation.
- Block Engine: Global allocator which partitions a large contiguous DRAM region into physical KV blocks.
- Free List: Manages available page IDs for allocation.
- Eviction Policy: When GPU blocks are exhausted, entire sequences or beam-groups are selected for eviction; blocks are either swapped to CPU RAM or dropped and recomputed.
These mechanisms allow both intra-request sharing (parallel sampling, beam search) and inter-request sharing (prefix-sharing), with support for fast block-level copy-on-write (Kwon et al., 2023, Rehg, 2024).
3. Algorithmic Workflow and Complexity
The forward attention computation is partitioned over blocks:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
function PagedAttention_LayerForward(query q_i, block_table, block_size B, hidden_dim d):
num_logical_blocks = ceil(i / B)
numerator, denominator = 0, 0
for j in 1..num_logical_blocks:
phys_id = block_table.phys_block_id[j]
K_j = read_from_phys_block(phys_id, role=Key) # shape (B, d)
scores = exp((q_i · K_j.T) / sqrt(d)) # shape (1, B)
sum_scores = sum(scores)
denominator += sum_scores
V_j = read_from_phys_block(phys_id, role=Value) # shape (B, d)
numerator += scores · V_j # shape (1, d)
o_i = numerator / denominator
return o_i |
Complexity:
- Attention complexity: per sequence (unaltered).
- PagedAttention overhead: for table lookups per query.
- Memory overhead per request: Drops from (contiguous) to (paged) (Kwon et al., 2023).
4. Memory Efficiency, Fragmentation, and Compression
PagedAttention eliminates most fragmentation by allocating/reclaiming memory in page-sized chunks. Internal fragmentation per sequence of length and page size is quantified as
External fragmentation across sequences is:
PagedAttention also enables advanced compression strategies, such as in KV-Compress. By evicting contiguous KV pages per attention head, variable-rate per-head compression can be physically implemented without incurring extra fragmentation. The page-eviction analysis is:
where is the number of retained KVs and the original number. Overall savings,
PagedAttention’s block-level granularity is key for reclaiming real GPU memory upon compression (Rehg, 2024).
5. Integration with vLLM, FlexAttention, and Production Systems
PagedAttention is implemented in high-throughput serving systems such as vLLM (Kwon et al., 2023) and IBM’s Foundation Model Stack (via FlexAttention) (Joshi et al., 8 Jun 2025). A centralized scheduler maintains all block tables, free pool, and reference counts. GPU workers execute SPMD code with custom CUDA kernels that respect block mappings.
When integrated via FlexAttention, PyTorch kernels gather scattered KV pages on-the-fly using the per-sequence page tables and hook functions (mask_mod, index_trans) to ensure that only the appropriate keys/values are attended to. The system supports thread block fusion, half-precision computation, and drop-in API replacement for SDPA/FlashAttention in production stacks.
This integration has enabled context lengths up to 100k tokens with predictable linear scaling and near-zero waste on commodity hardware (Joshi et al., 8 Jun 2025).
6. Quantitative Performance Analysis
PagedAttention achieves substantial improvements in throughput and memory efficiency:
- Memory waste: Existing systems reserve 62–80% of KV cache that is never used; PagedAttention’s overhead is 2% (≤1 block/sequence wasted).
- Throughput at 2 ms/token (OPT-13B):
| Model | FasterTransformer | Orca (Oracle) | vLLM (PagedAttention) |
|---|---|---|---|
| OPT-13B | 1.5 req/s | 12 req/s | 33 req/s |
| OPT-66B | 0.6 req/s | 4.5 req/s | 11.2 req/s |
| OPT-175B | 0.3 req/s | 2.0 req/s | 5.5 req/s |
PagedAttention increases request throughput by $2$– under realistic workloads, and up to with KV-Compress for long-context batch inference.
- Latency scaling (NVIDIA L4, PyTorch FlexAttention):
| Seq. Len | Baseline (no cache) | PagedAttention | Δ Memory |
|---|---|---|---|
| 128 | 0.50 ms/tok | 0.50 ms/tok | 0.02 GB |
| 256 | 1.20 ms/tok | 0.55 ms/tok | 0.02 GB |
| 512 | 2.80 ms/tok | 0.60 ms/tok | 0.04 GB |
| 1024 | 6.40 ms/tok | 0.70 ms/tok | 0.06 GB |
| 2048 | 15.60 ms/tok | 1.00 ms/tok | 0.20 GB |
Latency with PagedAttention grows %%%%2122%%%% across 128→2048 tokens (), while baseline grows per doubling () (Joshi et al., 8 Jun 2025).
7. Implementation Considerations, Trade-Offs, and Limitations
PagedAttention introduces several trade-offs and operational constraints:
- Kernel Overhead: Each CUDA layer incurs ~20–26% higher per-layer latency compared to tuned contiguous kernels due to page-table lookups and branching, but end-to-end throughput benefits outweigh overheads.
- Block Size Selection: Larger block sizes increase GPU parallelism but also internal fragmentation; smaller blocks minimize waste but stress allocation/swap bandwidth. Empirical defaults set for 13B–175B models (Kwon et al., 2023).
- CPU Overhead: Frequent small-block swaps may saturate PCIe bandwidth; recomputation is often preferable for small contexts.
- Inference-Only Support: Current implementations do not support paged back-propagation, limiting PagedAttention to inference (Joshi et al., 8 Jun 2025).
- Power-of-Two Artifacts: Above 2048 tokens, memory usage exhibits jumps at page-size increments, imposing minor overhead (<5%) at scale (Joshi et al., 8 Jun 2025).
Overall, PagedAttention provides a general, efficient framework for scalable LLM inference, enabling fine-grained, page-level memory management, high throughput, and extensible kernel integration for advanced sampling and compression scenarios in production LLM deployment (Kwon et al., 2023, Rehg, 2024, Joshi et al., 8 Jun 2025).