Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaling Laws of RoPE-based Extrapolation

Published 8 Oct 2023 in cs.CL and cs.AI | (2310.05209v2)

Abstract: The extrapolation capability of LLMs based on Rotary Position Embedding is currently a topic of considerable interest. The mainstream approach to addressing extrapolation with LLMs involves modifying RoPE by replacing 10000, the rotary base of $\theta_n={10000}{-2n/d}$ in the original RoPE, with a larger value and providing longer fine-tuning text. In this work, we first observe that fine-tuning a RoPE-based LLM with either a smaller or larger base in pre-training context length could significantly enhance its extrapolation performance. After that, we propose \textbf{\textit{Scaling Laws of RoPE-based Extrapolation}}, a unified framework from the periodic perspective, to describe the relationship between the extrapolation performance and base value as well as tuning context length. In this process, we also explain the origin of the RoPE-based extrapolation issue by \textbf{\textit{critical dimension for extrapolation}}. Besides these observations and analyses, we achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Alibaba. Qwen technical report. Technical report, 2023. URL https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf.
  2. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
  3. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  4. Overcoming a theoretical limitation of self-attention. arXiv preprint arXiv:2202.12172, 2022.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  7. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  8. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  9. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  10. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  11. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  12. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  13. LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023a. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.
  14. LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., June 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
  15. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  16. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  17. OpenAI. Gpt-4 technical report. Technical report, 2023.
  18. Giraffe: Adventures in expanding context lengths in llms. arXiv preprint arXiv:2308.10882, 2023.
  19. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  20. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  21. Shawn Presser. Books3, 2020. URL https://twitter.com/theshawwn/status/1320282149329784833.
  22. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  23. Parallel context windows improve in-context learning of large language models. arXiv preprint arXiv:2212.10947, 2022.
  24. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  25. Jianlin Su. Nbce: Naive bayes-based context extension, May 2023a.
  26. Jianlin Su. Improving transformer: Length extrapolation ability and position robustness. https://spaces.ac.cn/archives/9444, 2023b.
  27. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  28. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  31. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  32. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
Citations (63)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 6 likes about this paper.