Unbalanced Chunked Prefill Framework
- Unbalanced chunked prefill frameworks are system-level and algorithmic strategies that dynamically allocate prefill workloads across heterogeneous hardware under resource constraints.
- They optimize transformer inference by adapting chunk sizes to device throughput, memory availability, and interference effects, thereby boosting throughput and lowering latency.
- Implementations such as Cronus, TaiChi, MOM, and Sarathi demonstrate significant improvements in time-to-first-token, context extension, and pipeline efficiency.
Unbalanced chunked prefill is a set of system-level and algorithmic strategies for optimizing the prefill phase of transformer-based LLM inference, particularly under heterogeneous hardware or extreme memory constraints. Unlike balanced (uniform) chunking, which statically divides the prefill input into equally sized blocks, unbalanced chunked prefill frameworks dynamically vary chunk sizes or assign unequally partitioned work to different devices, aiming to maximize throughput, minimize latency, or adhere to resource limitations. Modern frameworks such as Cronus, TaiChi, MOM, and Sarathi implement distinct forms of this approach to address specific bottlenecks in throughput, memory, and GPU utilization (Liu et al., 22 Sep 2025, Wang et al., 4 Aug 2025, &&&2&&&, Agrawal et al., 2023).
1. Motivation and Core Principles
The prefill phase in LLM inference—context encoding for all prompt tokens—presents significant challenges. On heterogeneous clusters, hardware disparities between high- and low-end GPUs cause resource imbalances and underutilization if naively partitioned. In extreme context-length regimes, fixed-size memory allocations for growing KV caches can cause out-of-memory errors or excessive offloading overhead.
Balanced chunking, with equal-size partitions, fails to account for devices’ heterogeneous throughput, memory scaling with context position, or interference effects between concurrent decode and prefill stages. Unbalanced chunked prefill frameworks address these deficiencies by:
- Assigning disproportionate chunk sizes to compute devices in accordance with their throughput, memory, or interference profiles.
- Dynamically adapting chunk sizes as prefill proceeds (“greedy” or geometric decrease).
- Leveraging temporal overlap between the tail of prefill and the beginning of decoding to hide compute latency on disparate hardware.
These principles enable higher goodput, lower time-to-first-token (TTFT), and longer context window extensions compared to balanced chunking or traditional data/pipeline parallelism (Liu et al., 22 Sep 2025, Zhang et al., 16 Apr 2025).
2. System Designs and Algorithms
Four major system designs exemplify unbalanced chunked prefill frameworks.
Cronus: Partially Disaggregated Prefill on Heterogeneous GPU Clusters
Cronus partitions the prefill for tokens into chunks, with chunk 1 ( tokens) assigned to the low-end GPU and the remaining chunks () to high-end GPU(s). The key algorithm solves for the maximal such that the high-end completes its work within the time window provided by decoding tokens on the low-end, thus minimizing TTFT:
where , are prefill bandwidths and is the decode time per token. Empirically, suffices; is determined via convex minimization (Liu et al., 22 Sep 2025).
TaiChi: Unified Disaggregation–Aggregation with "Sliders" on SLOs
TaiChi unifies prefill-decode aggregation/disaggregation by instantiating two classes of instances: P-heavy (“prefill-optimized”, large chunk size ) and D-heavy (“decode-optimized”, small chunk size ). Scheduling is governed by:
- A resource-reallocation function that shifts request phases between P- and D-heavy GPUs as SLO slack is detected.
- Two schedulers: "flowing decode" (offloads decode requests to lower-priority devices when memory thresholds are exceeded or SLOs are met) and "length-aware prefill" (assigns prefill to the device with best TTFT margin).
The system’s three sliders—P/D ratio , chunk sizes and —allow interpolation across SLO regimes (Wang et al., 4 Aug 2025).
MOM: Memory-Efficient Unbalanced Chunks for Long Contexts
MOM’s approach is to adapt chunk sizes for each prefill chunk according to the current accumulation of KV cache and available GPU memory; early chunks are larger, shrinking as KV cache increases:
where is GPU memory, is static model memory, is total tokens processed before chunk , is hidden dim, is number of layers, is MLP activation size. By exactly saturating available memory per chunk (“greedy”), MOM enables context extension by a factor $2.5$– over baseline (Zhang et al., 16 Apr 2025).
Sarathi: Chunked-Prefill with Decode-Maximal Batching
Sarathi splits prefill requests into equal-sized chunks ( sizable for prefill saturation), then piggybacks up to decode requests into hybrid batches. Algorithmic focus is on selecting to optimize GPU utilization and minimize “bubbles” (idle stages) in pipeline-parallel microbatching (Agrawal et al., 2023).
3. Theoretical and Mathematical Models
Distinct analytic frameworks underpin unbalanced chunked prefill systems.
Latency Overlap in Heterogeneous Split: The time-to-first-token and throughput for Cronus are formalized as:
- ,
- Overlap gain: where is tokens decoded during
- Throughput for long generation:
Memory Cap Algorithms: MOM framework uses iterative (greedy or geometric) reduction of per chunk, formulated so that at each chunk, total memory is capped by . Analytical expressions compare balanced and unbalanced chunking for total context extension and peak/prechunk memory (Zhang et al., 16 Apr 2025).
Throughput, Goodput, SLO Satisfaction: TaiChi and related systems define goodput as the maximal request rate for which empirical SLO attainment exceeds a fixed fraction for TTFT and average time-per-output-token (TPOT). Scheduling policies are formally encoded to maximize this target (Wang et al., 4 Aug 2025).
Bubble Elimination: Sarathi analytically reduces the bubble factor by making each microbatch duration uniform through hybrid batch composition (Agrawal et al., 2023).
4. Trade-offs and Empirical Results
Unbalanced chunked prefill frameworks consistently outperform balanced or naïve approaches across throughput, latency, and memory utilization metrics.
| System | Key Metric | Baseline | Unbalanced Chunked Prefill | Relative Gain |
|---|---|---|---|---|
| Cronus | TTFT (p99, batch 1) | 200 ms (full disagg.) | 85 ms (T4+A100, LLaMA-7B) | 2.35× lower |
| Throughput (tok/s, b=8) | 220 (DP, GPT-2 XL) | 420 (T4+A100, Cronus) | 1.9× | |
| TaiChi | Goodput () | State-of-the-art | (balanced TTFT/TPOT SLOs) | up to 1.77× |
| MOM | Max context (Llama-3.2-8B) | 155k tokens (vanilla) | 370k–400k tokens (A100, unbalanced) | 2.5× |
| Sarathi | Pipeline bubble reduction | Baseline = | (GPT-3 pipeline, 64×A100) | reduction |
Note that in Cronus, using an unbalanced two-chunk partition, throughput and latency are improved relative to both data-parallelism and full prefill disaggregation, with throughput gains up to and TTFT reductions to over state-of-the-art (Liu et al., 22 Sep 2025). MOM demonstrates $2$– context extension and of the vanilla prefill memory at $150$k tokens (Zhang et al., 16 Apr 2025). Sarathi achieves up to faster decode throughput and overall throughput on LLaMA-13B/A6000 (Agrawal et al., 2023).
5. Implementation Techniques and Best Practices
Core implementation strategies for unbalanced chunked prefill include:
- Greedy or geometric schedule of chunk sizes, starting with maximal initial size and reducing per accumulated KV cache (MOM).
- Assignment heuristics for chunk-to-device mapping to minimize breach of memory or SLO constraints (Cronus, TaiChi).
- Immediate KV offloading per layer to minimize memory footprint at each chunk (MOM).
- Piggybacking decode requests with prefill chunks to fill hybrid batches, optimizing for GEMM efficiency and warp alignment (Sarathi).
- Regular benchmarking to identify throughput saturation points and select chunk sizes accordingly.
- Fine-tuning "slider" parameters (TaiChi) across deployment regimes, e.g., tight TTFT vs. TPOT SLOs.
Schedulers must execute in time (where is instance count), demanding lightweight model-based predictors for chunk ETA, and enforce constraints on maximum in-flight prefill/decode assignments per device class (Wang et al., 4 Aug 2025).
6. Comparative Analysis and Theoretical Insights
Balanced chunking achieves simplicity, but suffers under heterogeneous or high-memory regimes because it ignores device disparities and interaction terms (e.g., prefill–decode interference). Unbalanced chunking is superior in settings where:
- Devices differ dramatically in bandwidth, memory, or interference profiles (heterogeneous clusters).
- The prefill/intermediate memory cost grows with context position (very long input), causing static chunk schedules to OOM.
- TTFT and TPOT SLOs are in tension, requiring independent control per request or phase (TaiChi).
- Pipeline bubble time dominates, requiring “batch homogenization” (Sarathi).
A plausible implication is that unbalanced chunked prefill strategies constitute a new standard for large-scale LLM serving under realistic resource constraints. As total memory, throughput, and latency objectives tighten with model and dataset scale, dynamic and adaptive unbalanced chunking will become critical.
7. Outlook and Research Directions
Continued evolution of unbalanced chunked prefill frameworks is expected along these vectors:
- Automated chunk scheduler design leveraging reinforcement learning or continual calibration.
- Integration with finer-grained dynamic memory tracking and fragmentation-aware allocation.
- Deployment with next-generation multi-tier device topologies (e.g., hybrid GPU–NVM–CPU clusters).
- Synergistic scheduling with dynamic batch compaction, prompt caching, and decode-stage KV shrinkage.
Investigating the interaction of chunk partitioning strategies with architectural innovations (e.g., RTM, flash attention, compressive KV caches) and their effect on scaling laws for context extension and system throughput presents fruitful areas for future research (Wang et al., 4 Aug 2025, Zhang et al., 16 Apr 2025).
Key references: Cronus (Liu et al., 22 Sep 2025), TaiChi (Wang et al., 4 Aug 2025), MOM (Zhang et al., 16 Apr 2025), Sarathi (Agrawal et al., 2023).