PackInfer: Efficient Inference Packing
- PackInfer is a framework of compute- and I/O-aware packing methods that optimize inference efficiency for transformer-based LLMs and tree ensembles in resource-constrained settings.
- It utilizes workload-aware grouping and unified Q–K region packing techniques to reduce redundant computation and minimize data movement, improving GPU utilization and lowering latency.
- PackInfer extends to tree ensembles and IMC accelerators with serialization and bin-packing strategies that decrease I/O overhead and memory footprint while enhancing overall throughput.
PackInfer encompasses a set of compute- and I/O-aware packing methodologies for large-scale machine learning inference, targeting both transformer-based LLM serving and tree-ensemble deployment in resource-constrained settings. The central principle across PackInfer instantiations is elimination of redundant computation or data movement by intelligently grouping and organizing computation/data layouts based on workload heterogeneity and hardware constraints.
1. Motivation and Context
In modern inference serving—whether in transformer LLMs, tree ensembles, or in-memory compute (IMC) accelerators—the mismatch between model execution patterns and real-world request heterogeneity significantly impacts latency, throughput, and resource utilization. Key inefficiencies stem from:
- Compute imbalance: Batching variable-length sequences or requests leads to underutilization of compute units, especially when using fixed-size tiling approaches (e.g., FlashAttention kernels in LLMs), as short sequences occupy entire tiles (), resulting in SM idling on long-request stragglers (Ning et al., 3 Feb 2026).
- I/O imbalance: Scattered KV cache layouts and redundant prefix reads in LLMs, or random-access loading in tree ensembles stored on disk, waste bandwidth and exacerbate latency (Ning et al., 3 Feb 2026, Madhyastha et al., 2020).
- Memory bandwidth and capacity: Prefilling in LLMs with high prompt-length variance wastes FLOPs and memory via excessive padding (Zhao et al., 2024); weight-loading in IMC architectures bottlenecks end-to-end energy-delay product (EDP) (Houshmand et al., 2024).
PackInfer addresses these challenges through workload-aware grouping, layout compaction, and online adaptation to evolving token or request distributions.
2. Compute- and I/O-Aware Packing in LLM Inference
PackInfer (Ning et al., 3 Feb 2026) for transformer-based LLMs unifies compute-aware and I/O-aware grouping within a single kernel-level framework:
- Compute-Aware Grouping: Requests are partitioned into groups to maximize GPU TCU utilization:
grouping is performed greedily, minimizing per-group load imbalance and dynamically re-grouping when , the length drift, exceeds a threshold.
- Unified Q–K Region Packing: Within each group, all request sequences are offset-packed into contiguous blocks, enabling a single attention tiling free of per-request padding. Attention scores are masked to prevent cross-request leakage, reducing tile count from to .
- I/O-Aware Grouping and KV Cache Layout: Shared-prefix requests are identified via group-local prefix tries; common prefix KV blocks are loaded once into a contiguous buffer , followed by request-unique suffixes. Offsets record the location of each subblock, and headroom is preallocated per request to absorb future growth.
This combination minimizes straggler effects, saturates GPU compute, and abates memory bandwidth waste. CUDA kernel launches are tailored to group-wide Q and K spans; device memory is transformed prior to launch, optimizing I/O and smoothing memory fragmentation as generation evolves (Ning et al., 3 Feb 2026).
3. Serialization and Packed Inference for Tree Ensembles
PackInfer (Madhyastha et al., 2020)—as instantiated with PACSET (Packed Serialized Trees)—applies external-memory algorithmic principles to optimize inference for large tree ensembles (random forests, gradient-boosted trees) when model size exceeds available memory:
- Interleaved Bin Packing: The top levels of trees are interleaved across disk blocks, ensuring high static locality. For block size and node size , determines packing feasibility, and the fraction of useful root/upper-level nodes per I/O is maximized.
- Statistical Collocation: Nodes in each residual tree (i.e., below the bin) are ordered by leaf cardinality (number of samples visiting a given leaf), i.e. . Weighted DFS traversal ensures frequently-used root-to-leaf paths are co-located, further maximizing use per I/O.
- Block-aware Layout: Nodes are packed into I/O blocks via greedy, block-aligned segmentation, with each new block seeded by the unplaced node with highest .
- On-demand Loading: Deserialization during inference loads only the blocks traversed by the input—the model is never fully loaded into RAM. This technique achieves 2–6 lower latency and 50–80% fewer block loads in “larger than RAM” use-cases, with negligible DRAM footprint (Madhyastha et al., 2020).
4. Efficient Prefilling and Bin-Packing for Transformer Prompts
PackInfer-inspired prepacking (Zhao et al., 2024) for LLM prefilling reduces wasted computation arising from pad-token expansion:
- Bin-Packing Algorithm: Given prompts with lengths and bin capacity , prompts are assigned to bins such that:
with , . First-Fit Decreasing achieves assignment.
- Attention Mask and Position Encoding Modification: For each bin, a block-diagonal causal attention mask is synthesized to isolate each packed prompt’s tokens, and positional encodings are restarted per prompt to assure invariance of model output.
- Single-pass KV Cache Construction: Model forward pass is invoked with packed bins, then outputs are “unpacked” into the original prompt-aligned KV caches.
- Efficiency Gains: Prefilling time reduced by –, GPU memory by up to , and batch-size scaling is improved up to before OOM. When all prompts are equal-length, benefits vanish; for extremely long inputs streaming/quadratic cost is unavoidable (Zhao et al., 2024).
5. In-Memory Compute (IMC) Accelerators: Weight Packing and Mapping
For IMC hardware accelerators, PackInfer (Houshmand et al., 2024) denotes a hardware-software mapping strategy optimizing both compute throughput and weight-loading overhead:
- Tiling and Supertiling: Weights are partitioned into tiles and stacked into supertiles , respecting macro array dimensions .
- Column Packing and Macro Assignment: Supertile subsets are packed into 2D columns, maximizing density . Assignment of columns to macros obeys alignment and “one-tile-per-layer-per-macro” constraints via bin-packing. Folding (loop unrolling) is used when assignments are infeasible to shrink tile footprints temporally.
- Energy-Delay Optimization: Objective is
Dominant overhead transitions from DRAM weight-loading (when ) to compute as increases. For MLPerf Tiny workloads, once all weights fit on-chip, EDP reductions reach – over stacked/flattened baselines (Houshmand et al., 2024).
- Schedule Generation: PackInfer emits detailed schedules for weight loads, MAC execution, and buffer traffic, bridging network architecture and IMC fabric instantiation.
6. Quantitative Performance Outcomes
| Use Case | Key Metric | Baseline | PackInfer/Prepacking | Gain |
|---|---|---|---|---|
| LLM Attention (Ning et al., 3 Feb 2026) | TBT Latency | – | 13–20% lower | 1.13–1.20× |
| LLM Attention | Throughput | – | 20% higher | 1.20× |
| LLM Prefilling (Zhao et al., 2024) | Prefill Time | 380 ms | 110 ms (Llama2-7B) | 3.5× |
| PACSET (Madhyastha et al., 2020) | SSD Latency | BFS or DFS | 2–6× lower | 2–6× |
| IMC Inference (Houshmand et al., 2024) | EDP | Stacked/Flat | 10–100× lower | 10–100× |
PackInfer approaches are consistently most beneficial under conditions of input heterogeneity, bandwidth constraints, and limited memory.
7. Limitations and Extensions
PackInfer implementations exhibit certain limitations:
- Greedy Grouping: Heuristic grouping (e.g., longest-first binning) cannot guarantee global optimality for pathological length distributions (Ning et al., 3 Feb 2026).
- Preallocated Headroom: Additional memory is reserved to limit packing churn; aggressive suffix growth may still force repacking (Ning et al., 3 Feb 2026).
- Hardware/Workload Dependency: Tree-layout/hardware-tuning (e.g., PACSET bin/block shape, IMC packing parameters) must be matched to the target system; misconfiguration can degrade performance (Madhyastha et al., 2020, Houshmand et al., 2024).
- Scope: LLM Prepacking optimizes prefilling, not autoregressive generation (dynamic cache bin-packing is open) (Zhao et al., 2024).
- Single-device Designs: Most implementations target single-GPU (LLM) or single-device IMC mapping; scalable multi-device extensions remain an open area (Ning et al., 3 Feb 2026).
Suggested extensions include multi-GPU group packing, sparsity-aware attention packing, dynamic tile-size selection per group, and integration with job-size-aware schedulers (Ning et al., 3 Feb 2026, Houshmand et al., 2024).
In conclusion, PackInfer defines a family of workload- and hardware-adaptive packing mechanisms that enable substantial improvements in inference efficiency across diverse ML settings by directly addressing compute and I/O imbalances with principled grouping, layout, and scheduling strategies (Ning et al., 3 Feb 2026, Madhyastha et al., 2020, Zhao et al., 2024, Houshmand et al., 2024).