Papers
Topics
Authors
Recent
Search
2000 character limit reached

Goodput-Optimized LLM Serving

Updated 19 January 2026
  • Goodput-optimized LLM serving is defined as maximizing the rate of outputs that meet user-specified latency and quality SLOs, rather than raw throughput.
  • Architectural innovations like phase disaggregation and deadline-aware scheduling enable efficient resource isolation and improved token generation performance.
  • Dynamic batching, adaptive resource allocation, and hardware/software co-design techniques work together to optimize throughput and ensure consistent user experience in large-scale deployments.

Goodput-optimized LLM Serving refers to the class of methodologies, architectures, and algorithms that focus on maximizing the rate of useful LLM outputs (e.g., tokens or completed requests) that meet specified service-level objectives (SLOs) for latency and quality, rather than simply maximizing raw throughput. This discipline draws a formal distinction between system throughput (tokens/sec or requests/sec) and the rate at which completed outputs are delivered within user-specified latency and reliability guarantees—referred to as “goodput.” State-of-the-art research synthesizes queueing theory, resource allocation, scheduling, parallelism strategy, and deep hardware/software co-design to optimize this metric in large-scale, realistic deployments.

1. Goodput: Formal Definitions, Key Metrics, and Motivation

Goodput is defined as the maximum sustained arrival rate or completed output rate at which an LLM service can operate such that a specified fraction of requests meet both their designated SLOs (commonly Time-to-First-Token, TTFT, and Time-per-Output-Token, TPOT) (Hu et al., 6 Jun 2025, Wang et al., 2024, Mei et al., 2024, Zhu et al., 17 Jul 2025). For token-level metrics:

Goodput=i=1N1iOutputiT\text{Goodput} = \frac{\sum_{i=1}^{N} 1_i \cdot \mathrm{Output}_i}{T}

where 1i=11_i=1 if request ii meets its SLO constraints, and 0 otherwise (cf. (He et al., 2024)).

Recent analysis critiques naive per-token SLOs and batch-level goodput as failing to reflect end-user experience. The “smooth goodput” framework generalizes goodput, penalizing excessive user idle time and rewarding partial completion, with tunable parameters for latency sensitivity (α\alpha), user-consumption speed (VV), and penalty curves f()f(\cdot) (Wang et al., 2024):

SGP=rR(nrαf(lr))TSGP = \frac{\sum_{r \in R}\left(n_r - \alpha f(l_r)\right)}{T}

with lrl_r the maximal stall for any token in request rr.

This formalism reveals that maximizing true user-centric performance requires optimizing not just aggregate token delivery, but the rate of SLO-compliant outputs—detected via fine-grained, token-level, or streaming deadlines.

2. Architectures and Resource Partitioning for Goodput Attainment

Modern LLM serving architectures are organized to provide explicit resource isolation between the prefill (prompt-ingest) and decode (token generation) phases, reflecting their divergent compute and memory profiles:

Goodput-oriented systems such as DistServe (Zhong et al., 2024) and DOPD (Liao et al., 26 Nov 2025) demonstrate that phase disaggregation is critical for scaling to multi-model, high-concurrency, or mixed-length request patterns, especially as context windows increase and prompt variance diversifies. However, in regimes where network bandwidth or memory becomes limiting, strategies such as “hybrid” aggregation-disaggregation (Wang et al., 4 Aug 2025) and dynamic rebalancing of instance pools are used to shift latency budgets between phases, maximizing SLO-satisfying output.

3. Dynamic Scheduling and Batching Algorithms

Central to goodput-optimized LLM serving is the dynamic construction of batches and the scheduling of both requests and tokens. Key approaches include:

  • Adaptive Dynamic Batching: Incoming requests are grouped into real-time batches, with batch size and content selected to maximize GPU utilization while respecting per-request SLOs and memory constraints. Length-aware, SLO-aware, and resource-aware algorithms (e.g., SLO-ODBS in UELLM (He et al., 2024), multi-resource knapsack in AccelGen (Shen et al., 17 Mar 2025), bucket-based adaptive batching in BucketServe (Zheng et al., 23 Jul 2025)) reduce padding, fit batches to heterogeneous length/SLO mixes, and minimize out-of-memory incidents.
  • Deadline- or Slack-Aware Scheduling: Requests are routed and scheduled not only by FCFS or job-length, but by per-token or per-batch deadlines (PolyServe (Zhu et al., 17 Jul 2025), AccelGen (Shen et al., 17 Mar 2025)). Systems use profiling tables or runtime predictors to check admissibility (“wait-time-aware admission” (Zhu et al., 17 Jul 2025)) and exploit “load gradients” to prioritize tightest-SLO requests.
  • Phase-, Length-, and Priority-Aware Policies: Length-aware prefill dispatching (Liao et al., 26 Nov 2025, Wang et al., 4 Aug 2025) and iteration-level chunk sizing tied to SLOs (Shen et al., 17 Mar 2025) mitigate queueing and batching pathologies. SLO-based binning, dynamic chunking, and continuous chunked prefill are key to maximizing the fraction of output tokens delivered within their DSLO budgets.

The most effective schedulers dynamically route requests to the highest-utilization servers that remain SLO-safe, enabling fine-grained up- and down-scaling and higher realized goodput (Zhu et al., 17 Jul 2025).

4. Hardware/Software Co-Design and Systemic Optimizations

Goodput-optimized stacks leverage deep hardware–software co-design and analysis to maximize system utilization:

  • Roofline and Performance Models: Empirically calibrated roofline models predict runtime operator latency as a function of batch size and memory profile, supporting rapid simulation of deployment strategies (BestServe (Hu et al., 6 Jun 2025)). Separate GFLOPS measurements for prefill and decode guide kernel code generation and deployment (Sandwich (Zhao et al., 19 May 2025)).
  • Resource Partitioning: Non-uniform layer-to-device assignment, phase-specific kernel selection (e.g. Sandwich (Zhao et al., 19 May 2025)), and NUMA/LLC-aware core allocation are employed in both GPU- and CPU-based serving.
  • Cache/State Management: Stateful caching of key–value embeddings across multi-turn requests (Pensieve (Yu et al., 2023)) enables amortization of prefill costs and sharp reductions in recomputation, raising goodput by 1.5–2× in continuous conversational workloads.
  • Parallelism Tuning: Modern deployments embrace hybrid parallelism (tensor parallel, expert parallel, pipeline parallel), dynamically selecting degrees based on workload characteristics to balance interconnect, compute, and memory contention. LoongServe’s elastic sequence parallelism (ESP, (Wu et al., 2024)) and dynamic scaling of prefill/decoding actors (Liao et al., 26 Nov 2025) demonstrate significant throughput and goodput gains (up to 3.8–5.8× in real-world datasets).

Quantization, lossless/lossy delta compression (DeltaZip (Yao et al., 2023)), and expert/attention sparsification (OmniInfer (Wang et al., 27 Nov 2025)) further increase batch size, model packing, and effective throughput within target SLOs.

5. Cross-Workload and Multi-SLO Adaptation

A central challenge is accommodating heterogeneous and time-varying request patterns, including multi-SLO stratification and volatile arrival bursts:

  • Tiered Binning and Lazy Promotion: PolyServe (Zhu et al., 17 Jul 2025) partitions the fleet by SLO tier, enabling servers to flexibly serve both tight- and loose-SLO requests as capacity allows, with auto-scaling aligned to real workload composition.
  • Dynamic PD Ratio and Hybrid Modes: DOPD (Liao et al., 26 Nov 2025) and TaiChi (Wang et al., 4 Aug 2025) continuously forecast input/output length and arrival rates, then elastically reallocate prefill/decode resources to maintain optimal ratios under varying load. Fine-grained latency-shifting (as in TaiChi) reallocates resource “slack” across requests according to predicted risk of SLO violation, raising goodput by up to 77% in balanced SLO regimes.
  • Resilience to Phase/Network Bottlenecks: Network heterogeneity is modeled as edge-capacity limits in max-flow or MILP placement formulations (Helix (Mei et al., 2024)), enabling joint optimization of layer partition and routing for heterogeneous or geo-distributed clusters—increasing token throughput by up to 3.3× and sharply reducing prompt latency.
  • Profiling-Driven Adaptivity: Profiling reference models and elevator metrics support low-overhead, dynamic matching of resources to the predicted job and system properties (He et al., 2024, Hu et al., 6 Jun 2025).

The ability to maintain high SLO attainment and robustly scale across bursty and multi-modal traffic is a hallmark of advanced goodput optimization frameworks.

6. Extensions: Speculative Decoding and Edge-Server Collaboration

Optimizing speculative decoding for goodput leverages proxy/draft models to propose tokens that are then verified by the full LLM, with batch speculation length dynamically adapted to current system utilization and empirical token acceptance rate (Liu et al., 2024, Park et al., 16 May 2025). The “SmartSpec” approach dynamically picks batch- and instance-specific speculation length to maximize the expected number of accepted tokens per unit compute time, reducing average latency by up to 3.2× while never degrading under high load (Liu et al., 2024). In edge-server collaborative serving (SpecEdge (Park et al., 16 May 2025)), speculative draft tokens are generated on consumer-grade GPUs at the network edge, with server-side verification pipelined across multiple users, doubling net throughput and lowering inter-token latency without moving KV-caches over the WAN.

Edge/server codeployment and dynamic edge–server pipeline aware batching represent new frontiers in distributed goodput optimization.

7. Practical Guidelines, Limitations, and Future Directions

Comprehensive evaluations across real-world benchmarks converge on several general principles for practitioners:

  • Physically isolate compute-intensive and bandwidth/memory-limited phases.
  • Adopt dynamic, deadline- or SLO-aware batching and scheduling with length-aware prioritization.
  • Exploit hardware-aware profiling and simulation for fast, cost-effective deployment tuning, and incorporate runtime feedback loops for adaptivity.
  • Deploy resource- and SLO-stratified worker pools, but maintain flexible, lazy tier-promotion to absorb variable workload mixes.
  • Favor scheduling-based tail smoothing and resource allocation over output delay or token buffering, which may degrade user-perceived experience (Wang et al., 2024).

Challenges include accommodating tail-heavy input distributions (Liao et al., 26 Nov 2025), multi-tenant fairness (Wang et al., 2024), multi-objective tuning (throughput, energy, cost), and continued integration of new parallelism primitives (sparse attention, Mixture-of-Experts, edge collaboration). Speculative decoding, reinforcement learning-based offline–online hybrid scheduling (Pang et al., 14 Feb 2025), and user-aligned metric frameworks (e.g., “smooth goodput”) represent active areas of research.

Summary Table: Representative Goodput-Optimized LLM Serving Approaches

System Core Approach Goodput Gains (rel. prior)
DistServe (Zhong et al., 2024) Phase-disaggregation, resource co-optim. 2–4.5×
DOPD (Liao et al., 26 Nov 2025) Dynamic PD ratio adjustment ≤1.5×, 67.5% TTFT↓
BestServe (Hu et al., 6 Jun 2025) Roofline+queueing simulation for strat. search <15% error vs. benchmark
LoongServe (Wu et al., 2024) Elastic seq. parallelism, token-granular alloc. up to 5.8×
BucketServe (Zheng et al., 23 Jul 2025) Dynamic length-bucketing + adaptive batching 1.93× req load, 3.58× thrpt
PolyServe (Zhu et al., 17 Jul 2025) Multi-tier SLO binning + wait-time admission 1.23×, 92.5% optimal
AccelGen (Shen et al., 17 Mar 2025) SLO-aware chunking, multi-res batching up to 13.7×
Pensieve (Yu et al., 2023) Stateful cache reuse, paged multi-query attn 1.5–2×
ScaleLLM (Yao et al., 2024) HW/SW end-to-end opt., Rust gRPC gateway 1.5× vLLM, 4.3× HF
SmartSpec (Liu et al., 2024) Dynamic spec decode, batch/state adaptation ≤3.2× latency↓
DeltaZip (Yao et al., 2023) Delta compression, mixed-matmul-add serving 1.5–6×
SpecEdge (Park et al., 16 May 2025) Edge-server collaborative speculation 2.22× server throughput
Sandwich (Zhao et al., 19 May 2025) Prefill/decode phase-aware CPU compilation 2.01× throughput↑

Optimization at all layers—from resource partitioning to algorithmic scheduling, system kernel design, and end-user observable metrics—is required for maximal goodput in deployed, multi-tenant, SLO-bound LLM serving systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Goodput-optimized Large Language Model Serving.