Shared-KV Consolidator for Scalable LLM Inference
- Shared-KV Consolidators are high-performance systems that manage, compress, and share key-value caches across GPUs and nodes to accelerate transformer inference.
- They employ techniques like zero-copy DMA, CXL shared memory, and prefix-aware caching to minimize latency and boost throughput.
- Advanced synchronization, hash structures, and NUMA-aware allocation ensure memory consistency and efficient multi-node cache sharing in distributed ML workloads.
A Shared-KV Consolidator is a high-performance system for managing, compressing, and sharing key-value (KV) caches in large-scale deep learning deployments, particularly in rack-scale, multi-node, or multi-instance contexts. The consolidator enables efficient reuse, transfer, and management of KV tensors across computational workers or nodes, addressing bottlenecks in GPU memory, interconnect bandwidth, and time-to-first-token (TTFT) during transformer-based inference. Implementations such as TraCT leverage CXL shared memory and prefix-aware caching, allowing direct GPU access without intermediate networking layers, while other systems target disk, SSD, or distributed memory substrates for multi-instance sharing.
1. Architectural Foundations and Rack-Scale Design
Shared-KV Consolidators are architected to maximize prefix reuse and minimize data movement overhead in decomposed LLM inference pipelines. At rack scale, TraCT (Yoon et al., 20 Dec 2025) exemplifies this architecture through:
- Central CXL Type-3 shared memory, mapped via DAX into every server in the rack.
- Compute-heavy prefill workers and decode-phase latency-critical workers both access the same CXL region as byte-addressable memory.
- The consolidator's pipeline:
- Client requests prompt prefill.
- Prefill worker probes a prefix-index hash table in the CXL region.
- Cache-hit: GPU issues DMA read from shared CXL to retrieve required KV blocks.
- Cache-miss: GPU computes K/V, writes block to CXL via DMA, and updates the prefix-index.
- After all required blocks are present or generated, prefill signals the decode worker, which issues CXL→GPU DMA for all blocks and proceeds with autoregressive generation.
- Decoding attends fully over the shared KV cache; memory is freed and blocks eventually evicted by LRU logic.
All KV blocks are physically stored as aligned, contiguous arrays with per-block metadata footers. The system ensures true zero-copy DMA between GPU and CXL: no NIC, host DRAM, or software bounce buffer intervention.
2. Synchronization and Consistency Techniques
Non-coherent shared memory presents challenges in cross-node synchronization, metadata consistency, and cache-line state visibility. TraCT implements a two-tier synchronization mechanism:
Tier 1: Per-node local locks (pthread_mutex analogues) restrict entry into global arbitration, bounding cross-node contention by number of nodes.
- Tier 2: Global CXL lock array, organized into slot states ({IDLE, WAITING, LOCKED}). Entrants mark their slot WAITING, spin until LOCKED, then release, synchronizing metadata writes via explicit clflush + mfence boundaries to prevent stale data propagation.
This software arbitration strategy works without global atomics, ensuring lock-safe metadata manipulation and predictable cross-node consistency for operations on the shared hash table, allocator bitmap, and control pages.
3. Prefix-Aware KV Caching and Hash Structures
Efficient prefix-oriented cache hit logic is foundational. Shared-KV Consolidators use:
- Prefix-preserving block hashing: For sequence , block hashes are recursively defined as .
- Fixed-size block indices: Token blocks are mapped by (block size), with hashes stored in a static, load-factor bounded hash table alongside block metadata: .
- Linear probing and eviction: On cache insertions, probe for available bucket, allocate CXL storage, DMA block transfer, and clflush to mark as READY; hits atomically update ref_count; LRU logic manages head/tail updating and eviction with cache-line-isolated metadata.
This structure enables robust prefix-aware deduplication, both for cache hits on context reuse and for eviction safety via reference counting in concurrent multi-node workloads.
4. Non-Coherent Memory Management and Allocation
Maintaining metadata-payload isolation is essential in non-coherent shared memory.
- Payloads: Large block tensors (keys, values) are accessed only via DMA, never touched by CPUs—obviating cross-cache visibility issues.
- Metadata: Clusters small control structures into cache-line-aligned control pages, always modified with clflush + mfence.
- Allocation: Global chunk allocator in CXL, bitmap-driven, with two-tier locks for mutation; nodes request whole chunks, sub-allocate blocks within their own DRAM for per-node local heaps. The CXL object store publishes shared root pointers for all global indices.
5. Performance Modeling and Quantitative Impact
Formalism for TTFT and throughput in consolidators:
Let
- = model size (layers × heads × head_dim)
- = context length (tokens)
- = fraction of blocks served from cache
- = full KV cache size
- = bandwidth of CXL vs RDMA
- = NIC hop latency.
Formulas:
RDMA-based TTFT:
CXL-based TTFT:
TTFT speedup ratio:
Peak throughput scaling:
Empirical results: On real hardware (Niagara 2.0 CXL Type-3, 10 GB/s), TraCT reduces average TTFT by up to 9.8×, P99 prefill latency by 6.2×, and increases peak end-to-end throughput by 1.6× compared to RDMA/DRAM-based baselines; GPU compute occupancy and power both drop noticeably (Yoon et al., 20 Dec 2025).
6. Implementation Best Practices and Principles
Rigorous system design for Shared-KV Consolidation requires:
- Zero-copy DMA for all GPU-CXL transfers via page-locked host memory.
- Offset-only addressing: Store offsets, not virtual pointers, in shared structures for full locality.
- Cache-line isolation on hot metadata (buckets, locks, indices).
- Two-tier locking and explicit flushes for cross-node synchronization.
- Static hash tables/indices to avoid dynamic heap churn.
- NUMA awareness: Pin lock-manager and metadata threads to NUMA node attached to CXL root complex.
- Eviction safety via live ref_counting and proper delayed LRU removals.
Adherence to these principles results in a robust, scalable consolidator design capable of maintaining sub-microsecond metadata operation latencies, high-bandwidth rack-wide storage, and predictable, fault-tolerant cache sharing behavior.
7. Comparison with Alternative Shared-KV Management Mechanisms
TraCT is distinguished by tightly integrated CXL shared memory as the KV-transport and cache substrate. By comparison,
- Disk-based systems (e.g., Shared RAG-DCache) service multi-instance LLM inference through a disk-resident cache and proactive prefetching, but incur higher storage/retrieval latencies (Lee et al., 16 Apr 2025).
- Near-data processing (e.g. Co-KV) splits host and SSD compaction workloads for LSM-tree key-value stores, using collaborative offloading for throughput and write amplification reduction (Sun et al., 2018).
- Advanced transformers employ cross-layer fusion (e.g., FusedKV, FusedKV-Lite) or SVD-based compression (xKV, CommonKV) to consolidate or compress KV caches within/between layers, reducing memory further at the model representation level (Lin et al., 3 Dec 2025, Chang et al., 24 Mar 2025, Wang et al., 22 Aug 2025).
- Semantic sharing (e.g., SemShareKV) aligns token-level KV reuse between prompts via fuzzy matching and RoPE, targeting semantic redundancy rather than strictly shared prefix (Zhao et al., 29 Sep 2025).
- MoE/Multi-agent frameworks (PiKV, mixSGA, MoSKA, KVCOMM) deploy sharded, scheduled, or cross-context KV consolidation for distributed, heterogeneous compute environments (Liu et al., 2 Aug 2025, Rhee et al., 8 Nov 2025, Song et al., 16 Jun 2025, Ye et al., 14 Oct 2025).
Hence, Shared-KV Consolidators—when implemented on rack-scale CXL—yield the optimal TTFT reduction and throughput gains for high-concurrency, multi-GPU transformer deployments; they form the gold standard for tightly-coupled GPU/shared-memory LLM serving in contemporary distributed ML infrastructures.