Training-Free Long-Context Scaling of Large Language Models
Abstract: The ability of LLMs to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at \url{https://github.com/HKUNLP/ChunkLlama}.
- L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
- Anthropic. Introducing 100K Context Windows, 2023. URL https://www.anthropic.com/index/100k-context-windows.
- Clex: Continuous length extrapolation for large language models, 2023a.
- Extending context window of large language models via positional interpolation, 2023b.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023c.
- Dissecting transformer length extrapolation via the lens of receptive field analysis, 2023.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Monotonic location attention for length generalization, 2023.
- Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
- A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4599–4610, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.365. URL https://aclanthology.org/2021.naacl-main.365.
- Lm-infinite: Simple on-the-fly length generalization for large language models, 2023.
- Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024.
- Lora: Low-rank adaptation of large language models, 2021.
- Llm maybe longlm: Self-extend llm context window without tuning, 2024.
- The impact of positional encoding on length generalization in transformers, 2023.
- The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl˙a˙00023. URL https://aclanthology.org/Q18-1023.
- Prompted llms as chatbot modules for long open-domain conversation. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.277. URL http://dx.doi.org/10.18653/v1/2023.findings-acl.277.
- How long can open-source llms truly promise on context length. 2023a.
- Functional interpolation for relative positions improves long context transformers, 2023b.
- Lost in the middle: How language models use long contexts, 2023a.
- Scaling laws of rope-based extrapolation, 2023b.
- LMSYS. Vicuna: An open-source chatbot impressing gpt-4 with 90 URL https://lmsys.org/blog/2023-03-30-vicuna/.
- LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023a. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.
- LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., June 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
- Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
- MosaicML. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023a. URL www.mosaicml.com/blog/mpt-30b. Accessed: 2023-06-22.
- MosaicML. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023b. URL www.mosaicml.com/blog/mpt-7b.
- OpenAI. Gpt-4 technical report, 2023.
- QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5336–5358, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.391. URL https://aclanthology.org/2022.naacl-main.391.
- Yarn: Efficient context window extension of large language models, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
- Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. ArXiv, abs/2401.04658, 2024. URL https://api.semanticscholar.org/CorpusID:266900042.
- Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SylKikSYDH.
- Parallel context windows for large language models, 2023.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
- Code llama: Open foundation models for code, 2023.
- Procedural text mining with large language models, 2023.
- Randomized positional encodings boost length generalization of transformers, 2023.
- Pdftriage: Question answering over long, structured documents, 2023.
- Zebra: Extending context window with layerwise grouped local-global attention, 2023.
- Su, J. Rectified rotary position embeddings. https://github.com/bojone/rerope, 2023.
- Roformer: Enhanced transformer with rotary position embedding, 2022.
- A length-extrapolatable transformer, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Together. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, 2023. URL https://together.ai/blog/llama-2-7b-32k-instruct.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Focused transformer: Contrastive training for context scaling, 2023.
- Attention is all you need, 2017.
- Learning to retrieve in-context examples for large language models, 2024.
- Leveraging large language models to power chatbots for collecting user self-reported data, 2023.
- Efficient streaming language models with attention sinks, 2023.
- Effective long-context scaling of foundation models. CoRR, abs/2309.16039, 2023. doi: 10.48550/ARXIV.2309.16039. URL https://doi.org/10.48550/arXiv.2309.16039.
- Compositional exemplars for in-context learning. arXiv preprint arXiv:2302.05698, 2023.
- Linear attention via orthogonal memory. ArXiv, abs/2312.11135, 2023. URL https://api.semanticscholar.org/CorpusID:266359128.
- Soaring from 4k to 400k: Extending llm’s context with activation beacon. ArXiv, abs/2401.03462, 2024. URL https://api.semanticscholar.org/CorpusID:266844488.
- QMSum: A new benchmark for query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5905–5921, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.472. URL https://aclanthology.org/2021.naacl-main.472.
- Pose: Efficient context window extension of llms via positional skip-wise training, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.