Document-Packing Strategies

Updated 23 December 2025

Document-packing strategies are a set of algorithmic techniques that efficiently arrange variable-length texts into fixed-size training batches while preserving contextual boundaries.
They employ methods such as heuristic bin packing, concatenation, padding, and asymmetric slice-level approaches to optimize resource utilization and maintain semantic coherence.
Empirical evaluations highlight trade-offs in perplexity, throughput, and downstream performance, guiding the selection of strategies based on metrics like packing ratio and fill rate.

Document-packing strategies are a set of algorithmic and workflow techniques with the goal of arranging, segmenting, or combining texts—usually documents or variable-length token sequences—into training or inference batches that maximize the utilization of hardware, preserve contextual integrity, or optimize learning objectives. These strategies have become essential for large-scale LLM pre-training and fine-tuning, document layout analysis, and applications requiring efficient processing of variable-size or multi-modal inputs. Document-packing interacts deeply with core issues such as context coherence, compute throughput, dataset diversity, supervised fine-tuning efficiency, and downstream reasoning or composition ability.

1. Fundamental Principles and Formalisms

Document-packing is grounded in bin-packing and sequence-alignment problems:

Atom size ( $a$ ) and maximum sequence length ( $L$ , MSL): At the core, document-packing optimizes allocation of data atoms—token sequences of variable length but not exceeding $L$ —into bins or training windows of fixed length $L$ (Chen et al., 2024). If $a<L$ , atoms must be merged or padded; if $a>L$ , atoms are split across bins.
Packing objective: Minimize the number of bins (for maximal throughput), maximize bin fill rate, preserve boundaries (for context), and minimize compute/memory waste or truncation.
Key metrics: Perplexity for language modeling, packing ratio $\rho=\frac{\sum_{i=1}^B \ell_i}{B L_\text{max}}$ , total tokens processed per GPU-hour, and empty area or density for 2D layout (Chen et al., 2024, Pintea et al., 2012, Wang et al., 2024).
Packing integrity: Preservation of contextual coherence demands that atom or document boundaries respect semantic units; misaligned packing can cause artificial context fragmentation or corruption (Chen et al., 2024, Ding et al., 2024).

Formalizations typically specify decision variables $x_{i,j}\in\{0,1\}$ for assigning piece $i$ to bin $j$ under capacity and non-overlap constraints (1D/2D), with objectives such as: $\min_{x,y} \sum_j y_j \quad \text{s.t.} \quad \sum_{i} \ell(c_i)x_{i,j}\leq L y_j, \quad \sum_{j} x_{i,j}=1$ (Ding et al., 2024), or their geometric and 2D variants (Zhao et al., 2024, Pintea et al., 2012).

2. Packing Algorithms: Heuristics and Optimizations

2.1 Heuristic Bin Packing

Best-Fit-Decreasing (BFD): Documents/chunks are sorted in descending size, packed into bins whose remaining space is minimized but sufficient (classic 1D bin packing) (Ding et al., 2024).
First-Fit-Decreasing (FFD): Chunks are sorted and greedily placed into the first available bin with capacity, fast but can sometimes create more waste (Yin et al., 28 May 2025).
Greedy Packing: For SFT, sequences are sorted descendingly and allocated to bins maximizing fill below $L_\text{max}$ , preserving conversation or document boundaries whenever possible (Wang et al., 2024). Complexity typically $O(N\log L)$ or $O(N^2)$ ; segment-tree acceleration is used in large-scale implementations (Ding et al., 2024).

2.2 Concatenation and Padding

Concatenation (“concat”): Documents are streamed with boundary tokens and cut into bins, often resulting in context seams but perfect fill (Chen et al., 2024).
Padding: Atoms (documents or sequences) are ended and right-padded to exactly $L$ , preserving one document per chunk but at the cost of extra padding tokens and more steps (Chen et al., 2024).

2.3 Fine-Grained and Asymmetric Packing

SlimPack (slice-level and asymmetric): Decomposes input into small “slices,” balancing forward and backward computational loads via MILP-based partitioning, attuned to asymmetric cost profiles (backward attention ~2.5 $\times$ forward) (Liu et al., 30 Sep 2025). The pipeline consists of DP-balance, MicroPack MILP, and critical path simulation.

2.4 Packing for Document Layout

Mesh-candidate BestFit (2D): Used for image-like document layout, maintains a dynamic set of empty rectangular meshes and greedily fills with elements maximizing local area utilization, subject to containment, non-overlap, and implicit aesthetic regularity (Zhao et al., 2024). This approach favors “well-aligned” and dense layouts.

2.5 Packing for Retrieval/Sliding-Window Attention

Window-level packing: For transformers with local attention, documents are cut into overlapping windows, batching only windows with real tokens, which substantially reduces padding overhead for variable-length documents (Hofstätter et al., 2020). In document ranking, this enables near 50% reduction in wasted computation.

3. Empirical Performance, Trade-Offs, and Metrics

Empirical benchmarking has quantified the trade-offs associated with each method:

Method	Perplexity (↓)	Throughput (tokens/GPU-h ↑)	Contextual Integrity	Padding Overhead	Truncation/Fragmentation	Downstream Task Gains
Concat (a=L)	Higher	Highest	Mixed contexts	None	High	Baseline
Padding (a=L)	Lowest	Lower (~15–45% slower)	Full document	Some	None	+ PPL, + downstream
BFD/FFD BinPack	Lower	Near Concat	Document intact	≤ 0.01% extra	Minimal	+4–20% on tasks
SlimPack	N/A	1.15–2.8× over baselines	Sample/slice-order	N/A	Flexible	Up to 2.8× speedup

Source: (Chen et al., 2024, Ding et al., 2024, Liu et al., 30 Sep 2025, Wang et al., 2024).
Setting atom size $a=L$ (MSL) yields minimal perplexity and best trade-off, aligning context window to atom and eliminating spurious concatenation (Chen et al., 2024).
Padding always achieves lower perplexity at the expense of steps/ $E$ , while concatenation favors speed.
In Best-fit Packing, unnecessary truncations are reduced, and sequence utilization compared with concat is essentially identical (≤+0.003% in large-scale runs) (Ding et al., 2024).
Empirical downstream gains from optimal packing are significant: +4.7% (reading comprehension), +16.8% (context following), +9.2% (program synthesis), and hallucination reductions of up to 58.3% (Ding et al., 2024).
SlimPack achieves up to $2.8\times$ throughput improvement by balancing slice assignments, crucial for extreme long-context or heavy-tailed input size distributions (Liu et al., 30 Sep 2025).
Packing ratio $\rho$ and speedup $S$ are best supported by greedy or slice-level approaches for large SFT; typical wall-clock savings of 60–85% are reported (Wang et al., 2024).

4. Domain-Specific and Advanced Packing Strategies

4.1 Continual Pre-training

Seamless Packing: Combines a sliding-window overlap mechanism for long documents ( $r_\text{max} \approx 0.3$ ) with FFD-packing of short remainders; achieves up to 2 pp downstream task gains and reduces context discontinuity from 40% (concat) to <5% (Yin et al., 28 May 2025).

4.2 Multi-hop Reasoning

Packings for Cross-document Reasoning: Enables latent multi-hop capability by assembling sequences containing 4–6 documents, always with cross-document attention. Packing beyond this “sweet spot” degrades precision and increases hallucination (Prato et al., 16 Dec 2025). Epoch-wise repacking is necessary to avoid overfitting to static document groupings.

4.3 Supervised Fine-Tuning (SFT)

Random vs. Greedy Packing in SFT: Greedy packing preserves multi-turn context integrity, yielding up to 4.5 pt improvement on GPT-evaluated metrics for large LLaMA-3-70B models on 1M+ datasets (Wang et al., 2024). Gains are muted for small models or datasets. The effective batch size, batch size × learning rate scaling, and ratio of multi-turn to single-turn examples critically mediate SFT efficiency and representation learning.

4.4 Document Layout Analysis

Mesh-based Bin Packing for Layout Synthesis: DocLayout-YOLO uses mesh-candidate best-fit to maximize local fill rate and maintain global alignment and density, achieving best-in-class alignment (0.0009) and density (0.645) in synthetic document generation (Zhao et al., 2024).

5. Practical Implementation and Engineering Considerations

Implementation Pipelines:
- Pre-tokenize documents, append boundary (EOS/SEP), chunk to targets, and pack with chosen strategy.
- Apply randomized shuffling or epoch-wise repacking to maximize data coverage and avoid memorized context groupings (Chen et al., 2024, Prato et al., 16 Dec 2025).
- Cross-document attention masking is essential for logical document independence in packing; disabling it converts packed batches to equivalently distinct samples (Ding et al., 2024).
Example Pseudocode (Padding, $a=L$ ) (Chen et al., 2024):

class PaddingDataset(Dataset):
    def __init__(self, tokenized_docs, L):
        self.chunks = []
        for doc in tokenized_docs:
            for i in range(0, len(doc), L-1):
                slice_ = doc[i:i+L-1] + [EOS_ID]
                slice_ += [EOS_ID] * (L - len(slice_))
                self.chunks.append(torch.tensor(slice_))
        random.shuffle(self.chunks)
    def __len__(self): return len(self.chunks)
    def __getitem__(self, idx): 
        x = self.chunks[idx]
        return x[:-1], x[1:]

Hyperparameter Checklist:
- Atom size $a=L$ for coherence and randomness
- Batch size tuned to maximize hardware use
- Position encoding that enables efficient variable context lengths (ALiBi/rotary)
- Rigorous monitoring of packing ratio $\rho$ and sequence fill rates
Limitations: Excessively large $L$ relative to document size increases padding waste; fine-grained methods may incur MILP solver costs (SlimPack), and SFT requires bespoke hyperparameter tuning for effective LR/batch size scaling.

6. Implications, Best Practices, and Future Directions

Best Practices:
- For autoregressive LM training, set atom size $a$ equal to MSL; prefer padding for maximal accuracy, concat for throughput, and packing for balanced trade-off (Chen et al., 2024).
- For SFT, greedy packing is preferred for dialog/multi-turn; random packing is adequate for single-turn (Wang et al., 2024).
- In multi-document tasks, pack 4–6 documents per sequence and enable cross-document attention (Prato et al., 16 Dec 2025).
- For layout synthesis, mesh-candidate best-fit yields high alignment and density (Zhao et al., 2024).
Outlook: Scaling and hybridization of fine-grained techniques (e.g., SlimPack, mesh-candidate methods), tighter integration of pipeline simulation and hardware-aware scheduling, and adaptive tuning (via auto-tuned solvers) are active research directions. For tasks requiring inter-document relations, careful balancing of pack size and attention scope is critical to avoid hallucination and maximize emergent reasoning ability.
Key conceptual finding: Packing methods must balance resource utilization, representational coherence, and downstream generalization. There is no one-size-fits-all solution—the workload, data distribution, and objective all dictate the optimal packing strategy.

7. References

"Refining Packing and Shuffling Strategies for Enhanced Performance in Generative LLMs" (Chen et al., 2024)
"Fewer Truncations Improve Language Modeling" (Ding et al., 2024)
"SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training" (Liu et al., 30 Sep 2025)
"Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning" (Wang et al., 2024)
"Improving Continual Pre-training Through Seamless Data Packing" (Yin et al., 28 May 2025)
"Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of LLMs" (Prato et al., 16 Dec 2025)
"DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception" (Zhao et al., 2024)
"Comparing several heuristics for a packing problem" (Pintea et al., 2012)
"Local Self-Attention over Long Text for Efficient Document Retrieval" (Hofstätter et al., 2020)