Papers
Topics
Authors
Recent
Search
2000 character limit reached

PagedAttention Algorithm

Updated 21 January 2026
  • PagedAttention is an attention algorithm that segments the key-value cache into fixed-size pages, enabling efficient and scalable LLM inference.
  • It employs virtual memory-inspired techniques such as on-demand allocation, copy-on-write, and reference counting to minimize memory waste and redundancy.
  • The method integrates with production systems to significantly enhance throughput and reduce latency for long-context inference tasks.

@@@@1@@@@ is an attention algorithm and memory management approach for LLM inference, designed to eliminate the inefficiencies of conventional key-value (KV) caching via operating-system-inspired paging techniques. PagedAttention segments the KV cache into fixed-size blocks ("pages") that are dynamically allocated, mapped, and shared across requests, enabling near-zero memory waste, flexible KV sharing, and scalable support for long-context inference. By abstracting KV cache management to block-level operations, PagedAttention serves as a general foundation for further innovations such as per-head compression and fused attention kernel integration.

1. Virtual-Memory–Style KV Cache Paging

PagedAttention borrows from classical virtual memory management, treating each sequence’s KV cache as a collection of fixed-size logical pages (KV blocks), each holding the keys and values for BB consecutive tokens. Each sequence maintains a page table (block table) mapping logical pages to physical pages in GPU memory. Pages are allocated on demand rather than reserving contiguous memory for the maximal sequence length.

Physical KV blocks need not be contiguous in memory. Copy-on-write and reference counting allow multiple sequences—especially in scenarios such as beam search or parallel sampling—to share prefix blocks, avoiding redundant duplication. Allocation and reclamation of physical memory are performed in units of pages, enabling highly efficient and eviction-friendly memory usage (Kwon et al., 2023).

2. Paging Data Structures and Mechanisms

PagedAttention is underpinned by several core data structures:

  • KV Block: Stores key and value vectors for BB tokens across layers and heads.
  • Block Table: Per-sequence array indexed by logical block, storing physical block ID and occupancy.
  • Reference Count Table: Tracks how many logical blocks map to each physical block for safe reclamation.
  • Block Engine: Global allocator which partitions a large contiguous DRAM region into physical KV blocks.
  • Free List: Manages available page IDs for allocation.
  • Eviction Policy: When GPU blocks are exhausted, entire sequences or beam-groups are selected for eviction; blocks are either swapped to CPU RAM or dropped and recomputed.

These mechanisms allow both intra-request sharing (parallel sampling, beam search) and inter-request sharing (prefix-sharing), with support for fast block-level copy-on-write (Kwon et al., 2023, Rehg, 2024).

3. Algorithmic Workflow and Complexity

The forward attention computation is partitioned over blocks:

1
2
3
4
5
6
7
8
9
10
11
12
13
function PagedAttention_LayerForward(query q_i, block_table, block_size B, hidden_dim d):
    num_logical_blocks = ceil(i / B)
    numerator, denominator = 0, 0
    for j in 1..num_logical_blocks:
        phys_id = block_table.phys_block_id[j]
        K_j = read_from_phys_block(phys_id, role=Key)   # shape (B, d)
        scores = exp((q_i · K_j.T) / sqrt(d))            # shape (1, B)
        sum_scores = sum(scores)
        denominator += sum_scores
        V_j = read_from_phys_block(phys_id, role=Value) # shape (B, d)
        numerator += scores · V_j                        # shape (1, d)
    o_i = numerator / denominator
    return o_i
Appending a new token involves writing its KV to the last block, allocating a new page if the current block is full. Copy-on-write occurs if KV sharing is disrupted by divergent sequence paths.

Complexity:

  • Attention complexity: O(L2dH)O(L^2 \cdot d \cdot H) per sequence (unaltered).
  • PagedAttention overhead: O(L/B)O(\lceil L/B \rceil) for table lookups per query.
  • Memory overhead per request: Drops from O(LmaxdH)O(L_{\max} \cdot d \cdot H) (contiguous) to O(LdH+BdH)O(L \cdot d \cdot H + B \cdot d \cdot H) (paged) (Kwon et al., 2023).

4. Memory Efficiency, Fragmentation, and Compression

PagedAttention eliminates most fragmentation by allocating/reclaiming memory in page-sized chunks. Internal fragmentation per sequence of length LL and page size PP is quantified as

n=LPn = \Bigl\lceil \frac{L}{P}\Bigr\rceil

wasted_tokens=nPL\text{wasted\_tokens} = n \cdot P - L

frag_ratioint=nPLnP\text{frag\_ratio}_{\mathrm{int}} = \frac{n \cdot P - L}{n \cdot P}

External fragmentation across SS sequences is:

frag_ratioext=1iLiiniP\text{frag\_ratio}_{\mathrm{ext}} = 1 - \frac{\sum_i L_i}{\sum_i n_i \cdot P}

PagedAttention also enables advanced compression strategies, such as in KV-Compress. By evicting contiguous KV pages per attention head, variable-rate per-head compression can be physically implemented without incurring extra fragmentation. The page-eviction analysis is:

Compression ratio per head=rh=1Kh/Nh\text{Compression ratio per head} = r_h = 1 - K_h / N_h

where KhK_h is the number of retained KVs and NhN_h the original number. Overall savings,

S=1hKhHNS = 1 - \frac{\sum_h K_h}{H \cdot N}

PagedAttention’s block-level granularity is key for reclaiming real GPU memory upon compression (Rehg, 2024).

5. Integration with vLLM, FlexAttention, and Production Systems

PagedAttention is implemented in high-throughput serving systems such as vLLM (Kwon et al., 2023) and IBM’s Foundation Model Stack (via FlexAttention) (Joshi et al., 8 Jun 2025). A centralized scheduler maintains all block tables, free pool, and reference counts. GPU workers execute SPMD code with custom CUDA kernels that respect block mappings.

When integrated via FlexAttention, PyTorch kernels gather scattered KV pages on-the-fly using the per-sequence page tables and hook functions (mask_mod, index_trans) to ensure that only the appropriate keys/values are attended to. The system supports thread block fusion, half-precision computation, and drop-in API replacement for SDPA/FlashAttention in production stacks.

This integration has enabled context lengths up to 100k tokens with predictable linear scaling and near-zero waste on commodity hardware (Joshi et al., 8 Jun 2025).

6. Quantitative Performance Analysis

PagedAttention achieves substantial improvements in throughput and memory efficiency:

  • Memory waste: Existing systems reserve 62–80% of KV cache that is never used; PagedAttention’s overhead is \sim2% (≤1 block/sequence wasted).
  • Throughput at 2 ms/token (OPT-13B):
Model FasterTransformer Orca (Oracle) vLLM (PagedAttention)
OPT-13B 1.5 req/s 12 req/s 33 req/s
OPT-66B 0.6 req/s 4.5 req/s 11.2 req/s
OPT-175B 0.3 req/s 2.0 req/s 5.5 req/s

PagedAttention increases request throughput by $2$–4×4\times under realistic workloads, and up to 5.18×5.18\times with KV-Compress for long-context batch inference.

  • Latency scaling (NVIDIA L4, PyTorch FlexAttention):
Seq. Len Baseline (no cache) PagedAttention Δ Memory
128 0.50 ms/tok 0.50 ms/tok 0.02 GB
256 1.20 ms/tok 0.55 ms/tok 0.02 GB
512 2.80 ms/tok 0.60 ms/tok 0.04 GB
1024 6.40 ms/tok 0.70 ms/tok 0.06 GB
2048 15.60 ms/tok 1.00 ms/tok 0.20 GB

Latency with PagedAttention grows %%%%21O(L2dH)O(L^2 \cdot d \cdot H)22%%%% across 128→2048 tokens (O(L)O(L)), while baseline grows >10×>10\times per doubling (O(L2)O(L^2)) (Joshi et al., 8 Jun 2025).

7. Implementation Considerations, Trade-Offs, and Limitations

PagedAttention introduces several trade-offs and operational constraints:

  • Kernel Overhead: Each CUDA layer incurs ~20–26% higher per-layer latency compared to tuned contiguous kernels due to page-table lookups and branching, but end-to-end throughput benefits outweigh overheads.
  • Block Size Selection: Larger block sizes increase GPU parallelism but also internal fragmentation; smaller blocks minimize waste but stress allocation/swap bandwidth. Empirical defaults set B=16B=16 for 13B–175B models (Kwon et al., 2023).
  • CPU Overhead: Frequent small-block swaps may saturate PCIe bandwidth; recomputation is often preferable for small contexts.
  • Inference-Only Support: Current implementations do not support paged back-propagation, limiting PagedAttention to inference (Joshi et al., 8 Jun 2025).
  • Power-of-Two Artifacts: Above 2048 tokens, memory usage exhibits jumps at page-size increments, imposing minor overhead (<5%) at scale (Joshi et al., 8 Jun 2025).

Overall, PagedAttention provides a general, efficient framework for scalable LLM inference, enabling fine-grained, page-level memory management, high throughput, and extensible kernel integration for advanced sampling and compression scenarios in production LLM deployment (Kwon et al., 2023, Rehg, 2024, Joshi et al., 8 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PagedAttention Algorithm.