Papers
Topics
Authors
Recent
Search
2000 character limit reached

Document-Packing Strategies

Updated 23 December 2025
  • Document-packing strategies are a set of algorithmic techniques that efficiently arrange variable-length texts into fixed-size training batches while preserving contextual boundaries.
  • They employ methods such as heuristic bin packing, concatenation, padding, and asymmetric slice-level approaches to optimize resource utilization and maintain semantic coherence.
  • Empirical evaluations highlight trade-offs in perplexity, throughput, and downstream performance, guiding the selection of strategies based on metrics like packing ratio and fill rate.

Document-packing strategies are a set of algorithmic and workflow techniques with the goal of arranging, segmenting, or combining texts—usually documents or variable-length token sequences—into training or inference batches that maximize the utilization of hardware, preserve contextual integrity, or optimize learning objectives. These strategies have become essential for large-scale LLM pre-training and fine-tuning, document layout analysis, and applications requiring efficient processing of variable-size or multi-modal inputs. Document-packing interacts deeply with core issues such as context coherence, compute throughput, dataset diversity, supervised fine-tuning efficiency, and downstream reasoning or composition ability.

1. Fundamental Principles and Formalisms

Document-packing is grounded in bin-packing and sequence-alignment problems:

  • Atom size (aa) and maximum sequence length (LL, MSL): At the core, document-packing optimizes allocation of data atoms—token sequences of variable length but not exceeding LL—into bins or training windows of fixed length LL (Chen et al., 2024). If a<La<L, atoms must be merged or padded; if a>La>L, atoms are split across bins.
  • Packing objective: Minimize the number of bins (for maximal throughput), maximize bin fill rate, preserve boundaries (for context), and minimize compute/memory waste or truncation.
  • Key metrics: Perplexity for language modeling, packing ratio ρ=i=1BiBLmax\rho=\frac{\sum_{i=1}^B \ell_i}{B L_\text{max}}, total tokens processed per GPU-hour, and empty area or density for 2D layout (Chen et al., 2024, Pintea et al., 2012, Wang et al., 2024).
  • Packing integrity: Preservation of contextual coherence demands that atom or document boundaries respect semantic units; misaligned packing can cause artificial context fragmentation or corruption (Chen et al., 2024, Ding et al., 2024).

Formalizations typically specify decision variables xi,j{0,1}x_{i,j}\in\{0,1\} for assigning piece ii to bin jj under capacity and non-overlap constraints (1D/2D), with objectives such as: minx,yjyjs.t.i(ci)xi,jLyj,jxi,j=1\min_{x,y} \sum_j y_j \quad \text{s.t.} \quad \sum_{i} \ell(c_i)x_{i,j}\leq L y_j, \quad \sum_{j} x_{i,j}=1 (Ding et al., 2024), or their geometric and 2D variants (Zhao et al., 2024, Pintea et al., 2012).

2. Packing Algorithms: Heuristics and Optimizations

2.1 Heuristic Bin Packing

  • Best-Fit-Decreasing (BFD): Documents/chunks are sorted in descending size, packed into bins whose remaining space is minimized but sufficient (classic 1D bin packing) (Ding et al., 2024).
  • First-Fit-Decreasing (FFD): Chunks are sorted and greedily placed into the first available bin with capacity, fast but can sometimes create more waste (Yin et al., 28 May 2025).
  • Greedy Packing: For SFT, sequences are sorted descendingly and allocated to bins maximizing fill below LmaxL_\text{max}, preserving conversation or document boundaries whenever possible (Wang et al., 2024). Complexity typically O(NlogL)O(N\log L) or O(N2)O(N^2); segment-tree acceleration is used in large-scale implementations (Ding et al., 2024).

2.2 Concatenation and Padding

  • Concatenation (“concat”): Documents are streamed with boundary tokens and cut into bins, often resulting in context seams but perfect fill (Chen et al., 2024).
  • Padding: Atoms (documents or sequences) are ended and right-padded to exactly LL, preserving one document per chunk but at the cost of extra padding tokens and more steps (Chen et al., 2024).

2.3 Fine-Grained and Asymmetric Packing

  • SlimPack (slice-level and asymmetric): Decomposes input into small “slices,” balancing forward and backward computational loads via MILP-based partitioning, attuned to asymmetric cost profiles (backward attention ~2.5×\times forward) (Liu et al., 30 Sep 2025). The pipeline consists of DP-balance, MicroPack MILP, and critical path simulation.

2.4 Packing for Document Layout

  • Mesh-candidate BestFit (2D): Used for image-like document layout, maintains a dynamic set of empty rectangular meshes and greedily fills with elements maximizing local area utilization, subject to containment, non-overlap, and implicit aesthetic regularity (Zhao et al., 2024). This approach favors “well-aligned” and dense layouts.

2.5 Packing for Retrieval/Sliding-Window Attention

  • Window-level packing: For transformers with local attention, documents are cut into overlapping windows, batching only windows with real tokens, which substantially reduces padding overhead for variable-length documents (Hofstätter et al., 2020). In document ranking, this enables near 50% reduction in wasted computation.

3. Empirical Performance, Trade-Offs, and Metrics

Empirical benchmarking has quantified the trade-offs associated with each method:

Method Perplexity (↓) Throughput (tokens/GPU-h ↑) Contextual Integrity Padding Overhead Truncation/Fragmentation Downstream Task Gains
Concat (a=L) Higher Highest Mixed contexts None High Baseline
Padding (a=L) Lowest Lower (~15–45% slower) Full document Some None + PPL, + downstream
BFD/FFD BinPack Lower Near Concat Document intact ≤ 0.01% extra Minimal +4–20% on tasks
SlimPack N/A 1.15–2.8× over baselines Sample/slice-order N/A Flexible Up to 2.8× speedup
  • Source: (Chen et al., 2024, Ding et al., 2024, Liu et al., 30 Sep 2025, Wang et al., 2024).
  • Setting atom size a=La=L (MSL) yields minimal perplexity and best trade-off, aligning context window to atom and eliminating spurious concatenation (Chen et al., 2024).
  • Padding always achieves lower perplexity at the expense of steps/EE, while concatenation favors speed.
  • In Best-fit Packing, unnecessary truncations are reduced, and sequence utilization compared with concat is essentially identical (≤+0.003% in large-scale runs) (Ding et al., 2024).
  • Empirical downstream gains from optimal packing are significant: +4.7% (reading comprehension), +16.8% (context following), +9.2% (program synthesis), and hallucination reductions of up to 58.3% (Ding et al., 2024).
  • SlimPack achieves up to 2.8×2.8\times throughput improvement by balancing slice assignments, crucial for extreme long-context or heavy-tailed input size distributions (Liu et al., 30 Sep 2025).
  • Packing ratio ρ\rho and speedup SS are best supported by greedy or slice-level approaches for large SFT; typical wall-clock savings of 60–85% are reported (Wang et al., 2024).

4. Domain-Specific and Advanced Packing Strategies

4.1 Continual Pre-training

  • Seamless Packing: Combines a sliding-window overlap mechanism for long documents (rmax0.3r_\text{max} \approx 0.3) with FFD-packing of short remainders; achieves up to 2 pp downstream task gains and reduces context discontinuity from 40% (concat) to <5% (Yin et al., 28 May 2025).

4.2 Multi-hop Reasoning

  • Packings for Cross-document Reasoning: Enables latent multi-hop capability by assembling sequences containing 4–6 documents, always with cross-document attention. Packing beyond this “sweet spot” degrades precision and increases hallucination (Prato et al., 16 Dec 2025). Epoch-wise repacking is necessary to avoid overfitting to static document groupings.

4.3 Supervised Fine-Tuning (SFT)

  • Random vs. Greedy Packing in SFT: Greedy packing preserves multi-turn context integrity, yielding up to 4.5 pt improvement on GPT-evaluated metrics for large LLaMA-3-70B models on 1M+ datasets (Wang et al., 2024). Gains are muted for small models or datasets. The effective batch size, batch size × learning rate scaling, and ratio of multi-turn to single-turn examples critically mediate SFT efficiency and representation learning.

4.4 Document Layout Analysis

  • Mesh-based Bin Packing for Layout Synthesis: DocLayout-YOLO uses mesh-candidate best-fit to maximize local fill rate and maintain global alignment and density, achieving best-in-class alignment (0.0009) and density (0.645) in synthetic document generation (Zhao et al., 2024).

5. Practical Implementation and Engineering Considerations

  • Implementation Pipelines:
    • Pre-tokenize documents, append boundary (EOS/SEP), chunk to targets, and pack with chosen strategy.
    • Apply randomized shuffling or epoch-wise repacking to maximize data coverage and avoid memorized context groupings (Chen et al., 2024, Prato et al., 16 Dec 2025).
    • Cross-document attention masking is essential for logical document independence in packing; disabling it converts packed batches to equivalently distinct samples (Ding et al., 2024).
  • Example Pseudocode (Padding, a=La=L) (Chen et al., 2024):

1
2
3
4
5
6
7
8
9
10
11
12
13
class PaddingDataset(Dataset):
    def __init__(self, tokenized_docs, L):
        self.chunks = []
        for doc in tokenized_docs:
            for i in range(0, len(doc), L-1):
                slice_ = doc[i:i+L-1] + [EOS_ID]
                slice_ += [EOS_ID] * (L - len(slice_))
                self.chunks.append(torch.tensor(slice_))
        random.shuffle(self.chunks)
    def __len__(self): return len(self.chunks)
    def __getitem__(self, idx): 
        x = self.chunks[idx]
        return x[:-1], x[1:]

  • Hyperparameter Checklist:
    • Atom size a=La=L for coherence and randomness
    • Batch size tuned to maximize hardware use
    • Position encoding that enables efficient variable context lengths (ALiBi/rotary)
    • Rigorous monitoring of packing ratio ρ\rho and sequence fill rates
  • Limitations: Excessively large LL relative to document size increases padding waste; fine-grained methods may incur MILP solver costs (SlimPack), and SFT requires bespoke hyperparameter tuning for effective LR/batch size scaling.

6. Implications, Best Practices, and Future Directions

  • Best Practices:
    • For autoregressive LM training, set atom size aa equal to MSL; prefer padding for maximal accuracy, concat for throughput, and packing for balanced trade-off (Chen et al., 2024).
    • For SFT, greedy packing is preferred for dialog/multi-turn; random packing is adequate for single-turn (Wang et al., 2024).
    • In multi-document tasks, pack 4–6 documents per sequence and enable cross-document attention (Prato et al., 16 Dec 2025).
    • For layout synthesis, mesh-candidate best-fit yields high alignment and density (Zhao et al., 2024).
  • Outlook: Scaling and hybridization of fine-grained techniques (e.g., SlimPack, mesh-candidate methods), tighter integration of pipeline simulation and hardware-aware scheduling, and adaptive tuning (via auto-tuned solvers) are active research directions. For tasks requiring inter-document relations, careful balancing of pack size and attention scope is critical to avoid hallucination and maximize emergent reasoning ability.
  • Key conceptual finding: Packing methods must balance resource utilization, representational coherence, and downstream generalization. There is no one-size-fits-all solution—the workload, data distribution, and objective all dictate the optimal packing strategy.

7. References

  • "Refining Packing and Shuffling Strategies for Enhanced Performance in Generative LLMs" (Chen et al., 2024)
  • "Fewer Truncations Improve Language Modeling" (Ding et al., 2024)
  • "SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training" (Liu et al., 30 Sep 2025)
  • "Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning" (Wang et al., 2024)
  • "Improving Continual Pre-training Through Seamless Data Packing" (Yin et al., 28 May 2025)
  • "Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of LLMs" (Prato et al., 16 Dec 2025)
  • "DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception" (Zhao et al., 2024)
  • "Comparing several heuristics for a packing problem" (Pintea et al., 2012)
  • "Local Self-Attention over Long Text for Efficient Document Retrieval" (Hofstätter et al., 2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Document-Packing Strategies.