Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unbalanced Chunked Prefill Framework

Updated 2 February 2026
  • Unbalanced chunked prefill frameworks are system-level and algorithmic strategies that dynamically allocate prefill workloads across heterogeneous hardware under resource constraints.
  • They optimize transformer inference by adapting chunk sizes to device throughput, memory availability, and interference effects, thereby boosting throughput and lowering latency.
  • Implementations such as Cronus, TaiChi, MOM, and Sarathi demonstrate significant improvements in time-to-first-token, context extension, and pipeline efficiency.

Unbalanced chunked prefill is a set of system-level and algorithmic strategies for optimizing the prefill phase of transformer-based LLM inference, particularly under heterogeneous hardware or extreme memory constraints. Unlike balanced (uniform) chunking, which statically divides the prefill input into equally sized blocks, unbalanced chunked prefill frameworks dynamically vary chunk sizes or assign unequally partitioned work to different devices, aiming to maximize throughput, minimize latency, or adhere to resource limitations. Modern frameworks such as Cronus, TaiChi, MOM, and Sarathi implement distinct forms of this approach to address specific bottlenecks in throughput, memory, and GPU utilization (Liu et al., 22 Sep 2025, Wang et al., 4 Aug 2025, &&&2&&&, Agrawal et al., 2023).

1. Motivation and Core Principles

The prefill phase in LLM inference—context encoding for all prompt tokens—presents significant challenges. On heterogeneous clusters, hardware disparities between high- and low-end GPUs cause resource imbalances and underutilization if naively partitioned. In extreme context-length regimes, fixed-size memory allocations for growing KV caches can cause out-of-memory errors or excessive offloading overhead.

Balanced chunking, with equal-size partitions, fails to account for devices’ heterogeneous throughput, memory scaling with context position, or interference effects between concurrent decode and prefill stages. Unbalanced chunked prefill frameworks address these deficiencies by:

  • Assigning disproportionate chunk sizes to compute devices in accordance with their throughput, memory, or interference profiles.
  • Dynamically adapting chunk sizes as prefill proceeds (“greedy” or geometric decrease).
  • Leveraging temporal overlap between the tail of prefill and the beginning of decoding to hide compute latency on disparate hardware.

These principles enable higher goodput, lower time-to-first-token (TTFT), and longer context window extensions compared to balanced chunking or traditional data/pipeline parallelism (Liu et al., 22 Sep 2025, Zhang et al., 16 Apr 2025).

2. System Designs and Algorithms

Four major system designs exemplify unbalanced chunked prefill frameworks.

Cronus: Partially Disaggregated Prefill on Heterogeneous GPU Clusters

Cronus partitions the prefill for NN tokens into kk chunks, with chunk 1 (c1c_1 tokens) assigned to the low-end GPU and the remaining chunks (c2,,ckc_2,\ldots,c_k) to high-end GPU(s). The key algorithm solves for the maximal c1c_1 such that the high-end completes its work within the time window provided by decoding c1c_1 tokens on the low-end, thus minimizing TTFT:

TTFT(c1)=c1blow+max(Nc1bhighc1D,0)+D\mathrm{TTFT}(c_1) = \frac{c_1}{b_{\text{low}}} + \max\left(\frac{N-c_1}{b_{\text{high}}} - c_1 D, 0\right) + D

where blowb_{\text{low}}, bhighb_{\text{high}} are prefill bandwidths and DD is the decode time per token. Empirically, k=2k=2 suffices; c1c_1 is determined via convex minimization (Liu et al., 22 Sep 2025).

TaiChi: Unified Disaggregation–Aggregation with "Sliders" on SLOs

TaiChi unifies prefill-decode aggregation/disaggregation by instantiating two classes of instances: P-heavy (“prefill-optimized”, large chunk size cPc_P) and D-heavy (“decode-optimized”, small chunk size cDc_D). Scheduling is governed by:

  • A resource-reallocation function ff that shifts request phases between P- and D-heavy GPUs as SLO slack is detected.
  • Two schedulers: "flowing decode" (offloads decode requests to lower-priority devices when memory thresholds are exceeded or SLOs are met) and "length-aware prefill" (assigns prefill to the device with best TTFT margin).

The system’s three sliders—P/D ratio RPDR_{PD}, chunk sizes cPc_P and cDc_D—allow interpolation across SLO regimes (Wang et al., 4 Aug 2025).

MOM: Memory-Efficient Unbalanced Chunks for Long Contexts

MOM’s approach is to adapt chunk sizes kik_i for each prefill chunk according to the current accumulation of KV cache and available GPU memory; early chunks are larger, shrinking as KV cache increases:

kiMmaxWmodel2Ri1dLIk_i \leq \left\lfloor\frac{M_{\max} - W_{\text{model}} - 2 R_{i-1} d L}{I}\right\rfloor

where MmaxM_{\max} is GPU memory, WmodelW_{\text{model}} is static model memory, Ri1R_{i-1} is total tokens processed before chunk ii, dd is hidden dim, LL is number of layers, II is MLP activation size. By exactly saturating available memory per chunk (“greedy”), MOM enables context extension by a factor $2.5$–3×3\times over baseline (Zhang et al., 16 Apr 2025).

Sarathi: Chunked-Prefill with Decode-Maximal Batching

Sarathi splits prefill requests into k=Lin/ck = \lceil L_\text{in}/c\rceil equal-sized chunks (cc sizable for prefill saturation), then piggybacks up to N1N-1 decode requests into hybrid batches. Algorithmic focus is on selecting cc to optimize GPU utilization and minimize “bubbles” (idle stages) in pipeline-parallel microbatching (Agrawal et al., 2023).

3. Theoretical and Mathematical Models

Distinct analytic frameworks underpin unbalanced chunked prefill systems.

Latency Overlap in Heterogeneous Split: The time-to-first-token and throughput for Cronus are formalized as:

  • L1=c1/blowL_1 = c_1 / b_{\text{low}}, L2=c2/bhighL_2 = c_2 / b_{\text{high}}
  • Overlap gain: Δ=max(0,L2G)\Delta = \max(0, L_2 - G) where GG is tokens decoded during L2L_2
  • TTFT=L1+Δ+DTTFT = L_1 + \Delta + D
  • Throughput for long generation: TLLM1/DT_{\text{LLM}} \approx 1/D

Memory Cap Algorithms: MOM framework uses iterative (greedy or geometric) reduction of kik_i per chunk, formulated so that at each chunk, total memory is capped by MmaxM_{\max}. Analytical expressions compare balanced and unbalanced chunking for total context extension and peak/prechunk memory (Zhang et al., 16 Apr 2025).

Throughput, Goodput, SLO Satisfaction: TaiChi and related systems define goodput GG as the maximal request rate λ\lambda for which empirical SLO attainment A(λ)A(\lambda) exceeds a fixed fraction β\beta for TTFT and average time-per-output-token (TPOT). Scheduling policies are formally encoded to maximize this target (Wang et al., 4 Aug 2025).

Bubble Elimination: Sarathi analytically reduces the bubble factor R=Borig/BchunkedR = B_{\text{orig}} / B_{\text{chunked}} by making each microbatch duration uniform through hybrid batch composition (Agrawal et al., 2023).

4. Trade-offs and Empirical Results

Unbalanced chunked prefill frameworks consistently outperform balanced or naïve approaches across throughput, latency, and memory utilization metrics.

System Key Metric Baseline Unbalanced Chunked Prefill Relative Gain
Cronus TTFT (p99, batch 1) 200 ms (full disagg.) 85 ms (T4+A100, LLaMA-7B) 2.35× lower
Throughput (tok/s, b=8) 220 (DP, GPT-2 XL) 420 (T4+A100, Cronus) 1.9×
TaiChi Goodput (GG) State-of-the-art +77%+77\% (balanced TTFT/TPOT SLOs) up to 1.77×
MOM Max context (Llama-3.2-8B) 155k tokens (vanilla) 370k–400k tokens (A100, unbalanced) 2.5×
Sarathi Pipeline bubble reduction Baseline = BorigB_{\mathrm{orig}} Borig/6.29B_{\mathrm{orig}}/6.29 (GPT-3 pipeline, 64×A100) 6.29×6.29\times reduction

Note that in Cronus, using an unbalanced two-chunk partition, throughput and latency are improved relative to both data-parallelism and full prefill disaggregation, with throughput gains up to 2.1×2.1\times and TTFT reductions to 2.3×2.3\times over state-of-the-art (Liu et al., 22 Sep 2025). MOM demonstrates $2$–3×3\times context extension and 53%53\% of the vanilla prefill memory at $150$k tokens (Zhang et al., 16 Apr 2025). Sarathi achieves up to 10×10\times faster decode throughput and 1.33×1.33\times overall throughput on LLaMA-13B/A6000 (Agrawal et al., 2023).

5. Implementation Techniques and Best Practices

Core implementation strategies for unbalanced chunked prefill include:

  • Greedy or geometric schedule of chunk sizes, starting with maximal initial size and reducing per accumulated KV cache (MOM).
  • Assignment heuristics for chunk-to-device mapping to minimize breach of memory or SLO constraints (Cronus, TaiChi).
  • Immediate KV offloading per layer to minimize memory footprint at each chunk (MOM).
  • Piggybacking decode requests with prefill chunks to fill hybrid batches, optimizing for GEMM efficiency and warp alignment (Sarathi).
  • Regular benchmarking to identify throughput saturation points and select chunk sizes accordingly.
  • Fine-tuning "slider" parameters (TaiChi) across deployment regimes, e.g., tight TTFT vs. TPOT SLOs.

Schedulers must execute in O(N)O(N) time (where NN is instance count), demanding lightweight model-based predictors for chunk ETA, and enforce constraints on maximum in-flight prefill/decode assignments per device class (Wang et al., 4 Aug 2025).

6. Comparative Analysis and Theoretical Insights

Balanced chunking achieves simplicity, but suffers under heterogeneous or high-memory regimes because it ignores device disparities and interaction terms (e.g., prefill–decode interference). Unbalanced chunking is superior in settings where:

  • Devices differ dramatically in bandwidth, memory, or interference profiles (heterogeneous clusters).
  • The prefill/intermediate memory cost grows with context position (very long input), causing static chunk schedules to OOM.
  • TTFT and TPOT SLOs are in tension, requiring independent control per request or phase (TaiChi).
  • Pipeline bubble time dominates, requiring “batch homogenization” (Sarathi).

A plausible implication is that unbalanced chunked prefill strategies constitute a new standard for large-scale LLM serving under realistic resource constraints. As total memory, throughput, and latency objectives tighten with model and dataset scale, dynamic and adaptive unbalanced chunking will become critical.

7. Outlook and Research Directions

Continued evolution of unbalanced chunked prefill frameworks is expected along these vectors:

  • Automated chunk scheduler design leveraging reinforcement learning or continual calibration.
  • Integration with finer-grained dynamic memory tracking and fragmentation-aware allocation.
  • Deployment with next-generation multi-tier device topologies (e.g., hybrid GPU–NVM–CPU clusters).
  • Synergistic scheduling with dynamic batch compaction, prompt caching, and decode-stage KV shrinkage.

Investigating the interaction of chunk partitioning strategies with architectural innovations (e.g., RTM, flash attention, compressive KV caches) and their effect on scaling laws for context extension and system throughput presents fruitful areas for future research (Wang et al., 4 Aug 2025, Zhang et al., 16 Apr 2025).


Key references: Cronus (Liu et al., 22 Sep 2025), TaiChi (Wang et al., 4 Aug 2025), MOM (Zhang et al., 16 Apr 2025), Sarathi (Agrawal et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unbalanced Chunked Prefill Framework.