In-context Pretraining: Language Modeling Beyond Document Boundaries

Published 16 Oct 2023 in cs.CL, cs.AI, and cs.LG | (2310.10638v6)

Abstract: Large LMs are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where LLMs are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

Abstract PDF HTML Upgrade to Chat

References (63)

Citations (40)

View on Semantic Scholar

Summary

The paper presents in-context pretraining that enhances LLM performance by leveraging semantically related document sequences via efficient nearest neighbor search and graph traversal.
It achieves notable improvements including lower perplexity on diverse datasets, an 8% boost in text classification, and a 15% gain in reading comprehension tasks.
The method also enhances retrieval-augmented tasks and factual consistency, paving the way for more coherent and context-aware language models.

In-Context Pretraining: Language Modeling Beyond Document Boundaries

The paper, In-Context Pretraining: Language Modeling Beyond Document Boundaries, introduces an innovative method for pretraining LLMs aimed at enhancing their capability to understand and reason across document boundaries. Traditional LLM training methodologies concatenate randomly selected short documents to form input contexts. This method imposes a computational overhead without delivering beneficial pretraining signals, as the preceding documents offer no predictive cue for subsequent documents. Addressing this limitation, the authors propose In-Context Pretraining, which leverages sequences of related documents to provide richer context and improve overall LLM performance.

Methodology

The proposed In-Context Pretraining approach hinges on the idea of enhancing the relationships between sequentially presented documents. This method entails two primary components:

Efficient Nearest Neighbor Search: To identify semantically related documents, an approximate nearest neighbor (ANN) search is employed to create a document graph. This graph helps group documents by their semantic similarity using the contriever model for embedding and finding nearest neighbors.
Document Graph Traversal: Using a graph traversal algorithm formulated as a maximized traveling salesman problem, documents are arranged to optimize semantic coherence in each input context window, ensuring all documents are visited once in a weighted manner.

Experimental Setup and Results

The authors pretrain LLMs ranging from 0.3 to 7 billion parameters on 300 billion tokens obtained from the CommonCrawl dataset. They evaluate the proposed method across various tasks that measure different aspects of language modeling and contextual reasoning: standard language modeling, in-context learning, reading comprehension, retrieval augmentation, and handling of knowledge conflicts.

Key Findings:

Language Modeling: In-Context Pretraining consistently demonstrated lower perplexity across Wikipedia, Arxiv, and Books datasets (see Figure 1), outperforming standard pretraining and the k-NN baseline.
In-Context Learning: Evaluations on seven text classification datasets showed an average improvement of 8%. This result underscores the model’s superior ability to leverage demonstration examples.
Reading Comprehension: The methodology achieved a 15% average gain across tasks like RACE, SQuAD, and HotpotQA, showcasing enhanced complex contextual reasoning.
Retrieval-Augmentation: The model's performance in open-domain QA tasks improved by 9% when augmented with external knowledge sources, demonstrating alignment and reasoning over extended contexts.
Factuality and Knowledge Conflicts: The proposed method outperformed baselines on knowledge conflict datasets like NQ-Swap and MemoTrap, highlighting improved generation fidelity to prior contexts.

Implications and Future Directions

The implications of these results are substantial for both theoretical advancements and practical applications in artificial intelligence. The demonstrated improvements in understanding and reasoning across longer and more varied contexts suggest that LLMs trained with In-Context Pretraining could be substantially better at tasks requiring deep contextual comprehension, more accurate retrieval-augmentation, and robust handling of factual consistency.

Future developments could explore the cross-linguistic applications of this algorithm by grouping related documents in multilingual corpora. Moreover, investigating the inherent connections within specific domains, such as code repositories or medical texts, could extend the relevance and applicability of this approach. Integrating this pretraining approach with multitask finetuning strategies could further enhance its effectiveness, particularly for instruction-based models.

In-Context Pretraining offers a promising and scalable direction that merges well with existing pretraining pipelines by altering the preprocessing steps. This straightforward yet impactful innovation paves the way for constructing more coherent and contextually aware LLMs that set the stage for advancements in understanding, generating, and reasoning over text within and beyond document boundaries.