High-Throughput LLM Inference

Updated 18 February 2026

High-throughput LLM inference is a set of techniques that optimize token processing by combining batching strategies, dynamic scheduling, and memory management.
It employs queueing theoretic models and multi-bin batching to reduce latency and enhance hardware utilization for processing diverse requests.
Hardware-centric optimizations, including KV-cache compression and specialized parallelism, significantly boost throughput in interactive and large-context workloads.

High-throughput LLM inference refers to the design and optimization of techniques, algorithms, and systems for maximizing the number of tokens, requests, or agentic operations that can be processed per unit time during LLM inference. Achieving high throughput is critical for interactive applications, offline batch analytics, long-context document processing, and agentic schedulers, where latency and resource cost are tightly coupled to how efficiently the underlying hardware is utilized. The following sections summarize the principal techniques, control-theoretic and algorithmic frameworks, KV-cache and memory management methods, parallelism and batching strategies, and architecture-level approaches that define the current state of high-throughput LLM inference.

1. Queueing-Theoretic Foundations and Control Policies

Queueing-theoretic models formalize LLM inference as a batch-service queueing system, with Poisson arrivals and stochastic generation lengths, often with batches of fixed size $B$ and service times determined by the straggler request. For a batch of $B$ requests with independent lengths $L_j$ , the batch service time is $T_{\mathrm{batch}} = \max_{j=1,\dots,B} L_j$ , and steady-state throughput (tokens/sec) is $c = B/\mathbb{E}[T_{\mathrm{batch}}]$ (Guldogan et al., 2024). The upper bound is perfect parallelism $c_{max} = B/\mathbb{E}[L]$ .

Batching strategies that mix requests with highly variable lengths exhibit poor utilization: hardware must wait for the longest request in a batch. Throughput-optimal policies seek to minimize $\mathbb{E}[T_{\mathrm{batch}}]$ by grouping “similar-length” requests together. The multi-bin batching framework partitions the predicted length interval $[\ell_{min}, \ell_{max}]$ into $k$ bins with equi-probable boundaries; requests are assigned to bins by predicted execution length, and batches are formed only within bins. The throughput improves monotonically with $k$ ; as $k\to\infty$ and with perfect prediction, one achieves $c_k\to c_{max}$ . Notably, this bin-size tuning is underpinned by convexity analysis and order statistics for various service time distributions (Guldogan et al., 2024).

In single-server contexts, any “work-conserving” scheduling policy—always forming full batches whenever enough tokens are present—achieves the throughput bound dictated by the load condition $\lambda(m_p+m_d)<b/t_b$ , where $m_p,m_d$ are mean prefill and decode tokens and $t_b$ is per-batch processing time (Li et al., 10 Apr 2025). For multi-engine, homogeneous clusters, decentralizing to local work-conserving admission with global load-balancing achieves near-linear throughput scaling, provided load per engine stays within the individual bound.

2. System-Level Batch Scheduling and Algorithmic Innovations

Modern high-throughput LLM serving systems incorporate algorithmic scheduling designs that optimize batching, chunking, and concurrency:

Multi-bin batching policies: Group requests by predicted execution length before batching to reduce straggler-induced stalls (Guldogan et al., 2024).
Chunked prefill (Sarathi-Serve): The prefill phase is broken into near-equal-sized chunks, thereby enabling pipelines to inject prompt processing without stalling ongoing decodes, raising throughput while meeting tight time-between-token (TBT) SLOs (Agrawal et al., 2024). Every iteration packs decode tokens, partially prefills, and dynamically admits new requests up to a token budget.
Work-conserving mixed batching (Orca, Sarathi-Serve): These engines always co-batch up to the token budget, mixing prefill and decode tokens as permitted, proven to achieve maximal throughput in both queueing-theoretic analysis and production traces (Li et al., 10 Apr 2025).
Dynamic SplitFuse (DeepSpeed-FastGen): Ensures that each forward pass consists of a fixed number of tokens, splitting long prompts and fusing short requests to keep the hardware saturated, and interleaving prompt and decode progress to minimize both tail and average latency (Holmes et al., 2024).
Resource-aware batching for heterogeneous and SLO-mixed workloads (AccelGen): Dynamically adapts chunk sizes, batching, and queueing priorities to maximize GPU and KV-cache utilization under per-request SLOs, with both compute and memory constraints forming the batching policy's cost function (Shen et al., 17 Mar 2025).
Agent-level congestion control (CONCUR): For stateful, agentic LLM workloads, batch throughput may collapse due to mid-horizon KV-cache thrashing; an AIMD-style controller at the agent admission layer maintains cache pressure within thresholds, adjusting concurrency to avoid cache eviction and recomputation cycles (Chen et al., 30 Jan 2026).

3. Memory, KV-Cache, and Hardware Path Optimizations

Throughput in LLM inference is typically memory-bandwidth-bound, due to the rapid growth of the per-sequence key-value (KV) cache with batch size and context length.

Layer-wise and selective KV-cache compression (PyramidInfer): By leveraging inter-layer redundancy and recent attention consistency, PyramidInfer prunes cached keys/values in deeper layers while retaining nearly all accuracy, enabling up to 54% memory reduction and over 2x throughput improvement with negligible (<2%) quality loss (Yang et al., 2024).
Sparse and low-rank KV-cache (ShadowKV): Decomposes key matrices into low-rank factors and offloads values to host memory; upon decoding, only a sparse, on-the-fly selected subset of the cache is reconstructed for attention computations. This design supports up to 6× larger batch sizes and 3.04× throughput over baseline full-attention methods (Sun et al., 2024).
Quantized and segmental KV-cache (Intel GPU solution, FlexGen, LLM-CoOpt): Segment-based KV-cache partitioning and 4-bit/8-bit (FP8) quantization for both model weights and KV reduces memory and minimum batch size required for full hardware saturation. FlexGen leverages an LP-formulated offloading policy to orchestrate data movement between GPU, CPU, and disk, making it the first system to run 175B-parameter models with 1 token/s on a single 16GB GPU (Wu et al., 2023, Sheng et al., 2023). LLM-CoOpt adds skip-logic for KV-cache writes and tunes per-layer FP8 scales to recover >3% throughput per technique (Kong et al., 10 Feb 2026).
Cache placement offload (Glinthawk, PCM): Extracting the KV-cache and attention from the accelerator tier enables nearly arbitrary scaling of batch size and sequence length via horizontal scaling of memory-rich CPUs. Glinthawk separates compute and memory tiers, achieving up to 16.3× throughput improvement on long contexts, with careful pipeline, shard, and network balancing (Hamadanian et al., 20 Jan 2025). Pervasive Context Management (PCM) in heterogeneous HPC settings persistently maintains model contexts in GPU memory across preemptions, amortizing the model load overhead and enabling close-to-linear scaling with cluster size (Phung et al., 15 Oct 2025).

4. Parallelism, Dynamic Scheduling, and Distributed Throughput Scaling

Distributed inference of large LLMs exploits parallelism modes (tensor, pipeline, sequence, expert) and dynamic scheduling to maximize tokens/sec:

Dynamic model re-sharding (Seesaw): Recognizes that prefill (prompt) and decode stages require distinct optimal parallelization—pipeline-parallel for prefill, tensor-parallel for decode. Seesaw dynamically re-shards both model weights and KV-cache, using a tiered (CPU + GPU) KV buffer to minimize the transition overhead, and “cycle-sized” transitions to maximize per-phase batch efficiency. Throughput increases of up to 1.78× over vLLM are reported (Su et al., 9 Mar 2025).
Temporally-disaggregated pipeline parallelism (TD-Pipe): Eliminates pipeline bubbles by grouping many prefill batches, then many decode iterations, minimizing idle periods due to phase switch, and leveraging hierarchy controllers and AI-based workload prediction for prefill/decode transition scheduling. Throughput is increased by up to 1.91× versus tensor-parallel and 2.73× versus naive pipeline parallel schemes (Zhang et al., 12 Jun 2025).
Shift Parallelism: Dynamically selects between sequence parallelism (high-throughput, good for high traffic) and tensor parallelism (low-latency, good for low traffic) based on the active workload, maintaining a common KV-cache layout for switching efficiency. This yields 1.5× throughput and up to 9.16× lower TTFT compared to static parallel schemes (Hidayetoglu et al., 20 Sep 2025).

5. Hardware-Centric and Platform-Specific Strategies

Hardware specialization and end-to-end system design further contribute to maximum throughput:

Apple Silicon (MLX, MLC-LLM, …): Evaluations on M2 Ultra show platform-mature runtimes with high-throughput (MLX: ~230 tok/s), low-latency (MLC-LLM for sub-0.04s TTFT on moderate contexts), extensive quantization support, and sophisticated prompt/KV caching. MLX dominates batch throughput; MLC-LLM excels on long, interactive sequences (Rajesh et al., 9 Oct 2025).
End-to-end serving frameworks (ScaleLLM): Combining highly-optimized Rust gateway stacks, connection pooling, asynchronous batching, and customized engine scheduling, ScaleLLM achieves 4.3× lower latency and 1.5× higher throughput than existing endpoints (vLLM/FireworksAI/TogetherAI) at high concurrency (Yao et al., 2024).
Custom hardware kernels (Intel GPU solution): Wide kernel fusion (single pass GEMMs + attention), sequence-first cache layouts, and a custom SDPA kernel combine for up to 27× throughput on Intel platforms (Wu et al., 2023).

6. Scheduling for Heterogeneous and Agentic Workloads

Adaptive scheduling methods are critical in clusters with heterogeneous accelerators and mixed agentic or offline inference patterns:

Deployment and instance-level optimization: Integer programming and workload-aware profiling are used to analytically search deployment/tensor partitioning configurations to optimize memory, compute, and throughput constraints (Xiong et al., 18 Apr 2025).
Online scheduling: Request dispatch policies minimize maximum pending workload per instance via a memory-pressure-aware cost function, yielding up to 122.5% throughput improvement in single-machine mixes and 33.6% in multi-node clusters over naive round-robin (Xiong et al., 18 Apr 2025).
Agentic batch concurrency (CONCUR): For workloads with ReAct-style agents, traditional batch or LRU policies often lead to middle-phase thrashing of the KV-cache. CONCUR introduces an additive-increase, multiplicative-decrease controller that admits or pauses agents to maintain high cache hit rates and optimal throughput, improving batch performance by up to 4.09× on stateful agent workloads (Chen et al., 30 Jan 2026).

7. Outlook: Limitations, Open Challenges, and Research Directions

Although the current generation of high-throughput LLM inference systems delivers near-optimal hardware utilization, several limitations and open challenges remain:

Predictor quality: Throughput gains from “multi-bin” batching and SLO-based chunking are contingent on accurate request length prediction; major robustness is observed up to moderate prediction error, but systematic misclassifications can regress to worst-case batching (Guldogan et al., 2024, Shen et al., 17 Mar 2025).
KV-cache and attention scalability: Even with advanced compression (PyramidInfer, ShadowKV) or offloading regimes (Glinthawk), I/O bandwidth and per-layer redundancy remain principal bottlenecks as LLMs and input contexts scale.
Dynamic and heterogeneous workloads: Agentic systems, highly time-varying batch queues, and mixed-SLO application settings require sophisticated admission, scheduling, and resource allocation controllers that adaptively tune policies and load-balancing across endpoints (Chen et al., 30 Jan 2026, Shen et al., 17 Mar 2025).
Integration with future model architectures: As Transformer variants evolve (MoE, GQA, sparse attention), throughput-optimal scheduling techniques and memory/cache compression methods must co-adapt, especially when quantized or heavily sharded models are deployed (Kong et al., 10 Feb 2026).
Practical tuning and deployment: System operators are advised to begin with lightweight profiling to derive queueing and memory models specific to their workload, hardware, and prediction regime, and to adopt work-conserving, mixed batching as a throughput baseline (Li et al., 10 Apr 2025, Xiong et al., 18 Apr 2025).

Continued progress in memory hierarchies, model engineering, hardware-aware scheduling algorithms, and co-design of admission control, batching, and parallelism will define the next frontier in high-throughput LLM inference.