Papers
Topics
Authors
Recent
Search
2000 character limit reached

PackInfer: Efficient Inference Packing

Updated 10 February 2026
  • PackInfer is a framework of compute- and I/O-aware packing methods that optimize inference efficiency for transformer-based LLMs and tree ensembles in resource-constrained settings.
  • It utilizes workload-aware grouping and unified Q–K region packing techniques to reduce redundant computation and minimize data movement, improving GPU utilization and lowering latency.
  • PackInfer extends to tree ensembles and IMC accelerators with serialization and bin-packing strategies that decrease I/O overhead and memory footprint while enhancing overall throughput.

PackInfer encompasses a set of compute- and I/O-aware packing methodologies for large-scale machine learning inference, targeting both transformer-based LLM serving and tree-ensemble deployment in resource-constrained settings. The central principle across PackInfer instantiations is elimination of redundant computation or data movement by intelligently grouping and organizing computation/data layouts based on workload heterogeneity and hardware constraints.

1. Motivation and Context

In modern inference serving—whether in transformer LLMs, tree ensembles, or in-memory compute (IMC) accelerators—the mismatch between model execution patterns and real-world request heterogeneity significantly impacts latency, throughput, and resource utilization. Key inefficiencies stem from:

  • Compute imbalance: Batching variable-length sequences or requests leads to underutilization of compute units, especially when using fixed-size tiling approaches (e.g., FlashAttention kernels in LLMs), as short sequences occupy entire tiles (ηi=Li2T2\eta_i = \frac{L_i^2}{T^2}), resulting in SM idling on long-request stragglers (Ning et al., 3 Feb 2026).
  • I/O imbalance: Scattered KV cache layouts and redundant prefix reads in LLMs, or random-access loading in tree ensembles stored on disk, waste bandwidth and exacerbate latency (Ning et al., 3 Feb 2026, Madhyastha et al., 2020).
  • Memory bandwidth and capacity: Prefilling in LLMs with high prompt-length variance wastes FLOPs and memory via excessive padding (Zhao et al., 2024); weight-loading in IMC architectures bottlenecks end-to-end energy-delay product (EDP) (Houshmand et al., 2024).

PackInfer addresses these challenges through workload-aware grouping, layout compaction, and online adaptation to evolving token or request distributions.

2. Compute- and I/O-Aware Packing in LLM Inference

PackInfer (Ning et al., 3 Feb 2026) for transformer-based LLMs unifies compute-aware and I/O-aware grouping within a single kernel-level framework:

  • Compute-Aware Grouping: Requests are partitioned into GG groups to maximize GPU TCU utilization:

G=i=1NLiCG = \Bigl\lceil \frac{\sum_{i=1}^N L_i}{\mathcal C}\Bigr\rceil

grouping is performed greedily, minimizing per-group load imbalance and dynamically re-grouping when ΔL\Delta L, the length drift, exceeds a threshold.

  • Unified Q–K Region Packing: Within each group, all request sequences are offset-packed into contiguous blocks, enabling a single T×TT\times T attention tiling free of per-request padding. Attention scores are masked to prevent cross-request leakage, reducing tile count from GLmax/T2G \cdot \lceil L_{max}/T \rceil^2 to iLi/T2\lceil \sum_i L_i / T \rceil^2.
  • I/O-Aware Grouping and KV Cache Layout: Shared-prefix requests are identified via group-local prefix tries; common prefix KV blocks are loaded once into a contiguous buffer Bg\mathcal B_g, followed by request-unique suffixes. Offsets Og[i]\mathcal O_g[i] record the location of each subblock, and headroom δ\delta is preallocated per request to absorb future growth.

This combination minimizes straggler effects, saturates GPU compute, and abates memory bandwidth waste. CUDA kernel launches are tailored to group-wide Q and K spans; device memory is transformed prior to launch, optimizing I/O and smoothing memory fragmentation as generation evolves (Ning et al., 3 Feb 2026).

3. Serialization and Packed Inference for Tree Ensembles

PackInfer (Madhyastha et al., 2020)—as instantiated with PACSET (Packed Serialized Trees)—applies external-memory algorithmic principles to optimize inference for large tree ensembles (random forests, gradient-boosted trees) when model size exceeds available memory:

  • Interleaved Bin Packing: The top DD levels of TT trees are interleaved across disk blocks, ensuring high static locality. For block size BB and node size SnS_n, DTB/SnD\cdot T \leq \lfloor B/S_n \rfloor determines packing feasibility, and the fraction of useful root/upper-level nodes per I/O is maximized.
  • Statistical Collocation: Nodes in each residual tree (i.e., below the bin) are ordered by leaf cardinality c()c(\ell) (number of samples visiting a given leaf), i.e. c(n)=Leaves(n)c()c(n)=\sum_{ℓ\in\mathrm{Leaves}(n)}c(ℓ). Weighted DFS traversal ensures frequently-used root-to-leaf paths are co-located, further maximizing use per I/O.
  • Block-aware Layout: Nodes are packed into I/O blocks via greedy, block-aligned segmentation, with each new block seeded by the unplaced node with highest c(n)c(n).
  • On-demand Loading: Deserialization during inference loads only the blocks traversed by the input—the model is never fully loaded into RAM. This technique achieves 2–6×\times lower latency and 50–80% fewer block loads in “larger than RAM” use-cases, with negligible DRAM footprint (Madhyastha et al., 2020).

4. Efficient Prefilling and Bin-Packing for Transformer Prompts

PackInfer-inspired prepacking (Zhao et al., 2024) for LLM prefilling reduces wasted computation arising from pad-token expansion:

  • Bin-Packing Algorithm: Given kk prompts with lengths l1,,lkl_1,\ldots,l_k and bin capacity m=maxlim = \max l_i, prompts are assigned to rkr \leq k bins such that:

i=1kxi,blimyb, b\sum_{i=1}^k x_{i,b} l_i \leq m y_b,\ \forall b

with xi,b{0,1}x_{i,b} \in \{0,1\}, yb{0,1}y_b\in\{0,1\}. First-Fit Decreasing achieves O(klogk)O(k\log k) assignment.

  • Attention Mask and Position Encoding Modification: For each bin, a block-diagonal causal attention mask is synthesized to isolate each packed prompt’s tokens, and positional encodings are restarted per prompt to assure invariance of model output.
  • Single-pass KV Cache Construction: Model forward pass is invoked with packed bins, then outputs are “unpacked” into the original kk prompt-aligned KV caches.
  • Efficiency Gains: Prefilling time reduced by 1.6×1.6\times3.5×3.5\times, GPU memory by up to 60%60\%, and batch-size scaling is improved up to 16×16\times before OOM. When all prompts are equal-length, benefits vanish; for extremely long inputs streaming/quadratic cost is unavoidable (Zhao et al., 2024).

5. In-Memory Compute (IMC) Accelerators: Weight Packing and Mapping

For IMC hardware accelerators, PackInfer (Houshmand et al., 2024) denotes a hardware-software mapping strategy optimizing both compute throughput and weight-loading overhead:

  • Tiling and Supertiling: Weights are partitioned into tiles (Ti,To,Tm)(T_i, T_o, T_m) and stacked into supertiles (STi,STo,STm)(ST_i, ST_o, ST_m), respecting macro array dimensions (Di×Do×Dh×Dm)(D_i\times D_o\times D_h\times D_m).
  • Column Packing and Macro Assignment: Supertile subsets are packed into 2D columns, maximizing density ρcol\rho_{\rm col}. Assignment of columns to macros obeys alignment and “one-tile-per-layer-per-macro” constraints via bin-packing. Folding (loop unrolling) is used when assignments are infeasible to shrink tile footprints temporally.
  • Energy-Delay Optimization: Objective is

EDPtotal=Etotal×Ttotal=EDPcompute+act+EDPweight_loading{\rm EDP}_{\rm total} = E_{\rm total}\times T_{\rm total} = {\rm EDP}_{\rm compute+act} + {\rm EDP}_{\rm weight\_loading}

Dominant overhead transitions from DRAM weight-loading (when Dm=1D_m=1) to compute as DmD_m increases. For MLPerf Tiny workloads, once all weights fit on-chip, EDP reductions reach 10×10\times100×100\times over stacked/flattened baselines (Houshmand et al., 2024).

  • Schedule Generation: PackInfer emits detailed schedules for weight loads, MAC execution, and buffer traffic, bridging network architecture and IMC fabric instantiation.

6. Quantitative Performance Outcomes

Use Case Key Metric Baseline PackInfer/Prepacking Gain
LLM Attention (Ning et al., 3 Feb 2026) TBT Latency 13–20% lower 1.13–1.20×
LLM Attention Throughput 20% higher 1.20×
LLM Prefilling (Zhao et al., 2024) Prefill Time 380 ms 110 ms (Llama2-7B) 3.5×
PACSET (Madhyastha et al., 2020) SSD Latency BFS or DFS 2–6× lower 2–6×
IMC Inference (Houshmand et al., 2024) EDP Stacked/Flat 10–100× lower 10–100×

PackInfer approaches are consistently most beneficial under conditions of input heterogeneity, bandwidth constraints, and limited memory.

7. Limitations and Extensions

PackInfer implementations exhibit certain limitations:

  • Greedy Grouping: Heuristic grouping (e.g., longest-first binning) cannot guarantee global optimality for pathological length distributions (Ning et al., 3 Feb 2026).
  • Preallocated Headroom: Additional memory is reserved to limit packing churn; aggressive suffix growth may still force repacking (Ning et al., 3 Feb 2026).
  • Hardware/Workload Dependency: Tree-layout/hardware-tuning (e.g., PACSET bin/block shape, IMC packing parameters) must be matched to the target system; misconfiguration can degrade performance (Madhyastha et al., 2020, Houshmand et al., 2024).
  • Scope: LLM Prepacking optimizes prefilling, not autoregressive generation (dynamic cache bin-packing is open) (Zhao et al., 2024).
  • Single-device Designs: Most implementations target single-GPU (LLM) or single-device IMC mapping; scalable multi-device extensions remain an open area (Ning et al., 3 Feb 2026).

Suggested extensions include multi-GPU group packing, sparsity-aware attention packing, dynamic tile-size selection per group, and integration with job-size-aware schedulers (Ning et al., 3 Feb 2026, Houshmand et al., 2024).


In conclusion, PackInfer defines a family of workload- and hardware-adaptive packing mechanisms that enable substantial improvements in inference efficiency across diverse ML settings by directly addressing compute and I/O imbalances with principled grouping, layout, and scheduling strategies (Ning et al., 3 Feb 2026, Madhyastha et al., 2020, Zhao et al., 2024, Houshmand et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PackInfer.