Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference

Published 16 Oct 2023 in cs.CL and cs.LG | (2310.10845v2)

Abstract: Scaling LLMs to larger and deeper sizes has led to significant boosts in performance. Even though the size of these models limits their application in compute-constrained environments, the race to continually develop ever larger and deeper foundational models is underway. At the same time -- regardless of the model size -- task-specific techniques continue to play a pivotal role in achieving optimal downstream performance. One of these techniques, called Chain-of-Thought (CoT), is particularly interesting since, as we point out in this work, it resembles employing a deeper transformer through re-applying the model multiple times. However, a key subtlety in computing the attention of past tokens differentiates CoT from simply applying the model several times. Based on this insight, we propose CoTFormer, a novel architecture which closely mimics CoT at the token level, allowing us to obtain significantly improved accuracies close to much larger models. While applying CoT introduces additional computation costs, we compensate for it by leveraging CoTFormer's special compatibility with token-wise variable depth. Through a compute adaptive model -- which automatically allocates the compute to tokens that need it most -- we show that it is possible to reduce the computation cost significantly without any reduction in accuracy, and with further compute cost reductions possible while maintaining a competitive accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  2. Training Verifiers to Solve Math Word Problems, 2021. URL https://arxiv.org/abs/2110.14168.
  3. Universal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7.
  4. Depth-adaptive transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SJg7KhVKPH.
  5. OpenWebText2 dataset, as part of ‘the Pile: An 800gb dataset of diverse text for language modeling‘. ArXiv preprint, abs/2101.00027, 2021. URL https://arxiv.org/abs/2101.00027.
  6. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  7. Faster depth-adaptive transformers. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp.  13424–13432. AAAI Press, 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/17584.
  8. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  9. OpenAI. GPT-4 Technical Report, 2023. URL https://arxiv.org/abs/2303.08774.
  10. Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SylKikSYDH.
  11. Lessons on parameter sharing across layers in transformers. ArXiv preprint, abs/2104.06022, 2021. URL https://arxiv.org/abs/2104.06022.
  12. Llama: Open and efficient foundation language models, 2023a.
  13. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  14. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  15. Emergent Abilities of Large Language Models, 2022a. URL https://arxiv.org/abs/2206.07682.
  16. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.

Summary

  • The paper introduces CoTFormer, a transformer that uses intermediary token generation to simulate deeper reasoning steps.
  • It leverages a chain-of-thought mechanism to enable attention across tokens, matching the capacity of much deeper models.
  • Empirical results show significant perplexity reductions on OpenWebText2, demonstrating its effectiveness in resource-constrained setups.

CoTFormer: More Tokens With Attention Make Up For Less Depth

The paper "CoTFormer: More Tokens With Attention Make Up For Less Depth" by Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi from EPFL introduces a novel transformer architecture aimed at enhancing model efficiency and performance. This work innovatively combines the principle of Chain-of-Thought (CoT) with the Transformer architecture to propose the CoTFormer, which achieves comparable capacity to deeper models through an implicit CoT-like mechanism.

Introduction

The study addresses the ongoing trend of developing increasingly larger foundational models utilizing the Transformer architecture. These models have exhibited remarkable zero-shot and few-shot learning capabilities across various tasks. Nevertheless, challenges persist in very deep models, particularly in domains such as mathematical problem-solving. Chain-of-Thought (CoT) has been proposed to mitigate these challenges by leveraging step-by-step reasoning, leading to significant performance improvements. The authors draw parallels between CoT and deep transformers, proposing that the CoT mechanism can approximate a deeper model's capacity.

Method

The CoTFormer architecture is predicated on the generation of intermediary tokens that carry intermediate reasoning steps akin to CoT. This mechanism enables the model to achieve a depth-like effect through iterative token addition and attention. The process involves multiple passes of the initial input through the model, generating new tokens at each pass and interleaving them with the existing tokens. This iterative process results in a sequence where newer tokens can attend to previous ones, mimicking the depth of a multi-layer transformer.

The proposed method involves generating the next sequence of tokens by allowing each token to contribute additional intermediary tokens, which can directly attend to prior tokens. This attention mechanism distinguishes CoTFormer from traditional methods such as Block Universal Transformers, which apply recursive layer application without incorporating the intermediary token attention.

Experiments

The empirical evaluation was conducted on two language modeling datasets: PG19 and OpenWebText2. The experiments benchmarked the CoTFormer against standard transformers and Block Universal Transformers. Training settings included a fixed number of steps, the AdamW optimizer, and a learning rate schedule.

The results, summarized in Table 1 and Figure 1, illustrate the efficacy of CoTFormer. Notably, a CoTFormer with 24 layers and an interleaving factor of 2 (n=2n=2) outperformed a standard 48-layer transformer. This outcome underscores the importance of allowing attention to intermediary tokens, as CoTFormer consistently outperformed the Block Universal Transformer for a given depth. For instance, the CoTFormer significantly reduced the perplexity on the OpenWebText2 dataset compared to models of equivalent computational depth.

Implications and Future Work

The findings suggest that CoTFormer can achieve superior performance with fewer layers by leveraging the generation of intermediary tokens and their respective attention mechanisms. This approach offers practical implications, especially where memory limitations dominate and model size must be minimized without compromising performance. While the computational cost remains high due to the quadratic complexity of attention mechanisms, the significant reduction in model size could be beneficial in resource-constrained scenarios.

Future research directions include exploring depth-adaptive architectures, where the number of intermediary token passes varies among tokens, potentially further enhancing performance. The authors also suggest optimizing the efficiency of CoTFormer to better balance computational overhead and performance gains.

Conclusion

The CoTFormer presents an innovative method that bridges the gap between traditional deep transformers and CoT mechanisms, providing a means to build performant, yet shallower, models. By emphasizing the ability to allow attention to intermediary tokens, CoTFormer demonstrates a compelling advantage over existing architectures like Block Universal Transformers. This work offers meaningful insights into enhancing transformer models and lays the groundwork for future exploration into efficient, scalable model architectures.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 1051 likes about this paper.