CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference

Published 16 Oct 2023 in cs.CL and cs.LG | (2310.10845v2)

Abstract: Scaling LLMs to larger and deeper sizes has led to significant boosts in performance. Even though the size of these models limits their application in compute-constrained environments, the race to continually develop ever larger and deeper foundational models is underway. At the same time -- regardless of the model size -- task-specific techniques continue to play a pivotal role in achieving optimal downstream performance. One of these techniques, called Chain-of-Thought (CoT), is particularly interesting since, as we point out in this work, it resembles employing a deeper transformer through re-applying the model multiple times. However, a key subtlety in computing the attention of past tokens differentiates CoT from simply applying the model several times. Based on this insight, we propose CoTFormer, a novel architecture which closely mimics CoT at the token level, allowing us to obtain significantly improved accuracies close to much larger models. While applying CoT introduces additional computation costs, we compensate for it by leveraging CoTFormer's special compatibility with token-wise variable depth. Through a compute adaptive model -- which automatically allocates the compute to tokens that need it most -- we show that it is possible to reduce the computation cost significantly without any reduction in accuracy, and with further compute cost reductions possible while maintaining a competitive accuracy.

Abstract PDF HTML Upgrade to Chat

References (16)

Summary

The paper introduces CoTFormer, a transformer that uses intermediary token generation to simulate deeper reasoning steps.
It leverages a chain-of-thought mechanism to enable attention across tokens, matching the capacity of much deeper models.
Empirical results show significant perplexity reductions on OpenWebText2, demonstrating its effectiveness in resource-constrained setups.

CoTFormer: More Tokens With Attention Make Up For Less Depth

The paper "CoTFormer: More Tokens With Attention Make Up For Less Depth" by Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi from EPFL introduces a novel transformer architecture aimed at enhancing model efficiency and performance. This work innovatively combines the principle of Chain-of-Thought (CoT) with the Transformer architecture to propose the CoTFormer, which achieves comparable capacity to deeper models through an implicit CoT-like mechanism.

Introduction

The study addresses the ongoing trend of developing increasingly larger foundational models utilizing the Transformer architecture. These models have exhibited remarkable zero-shot and few-shot learning capabilities across various tasks. Nevertheless, challenges persist in very deep models, particularly in domains such as mathematical problem-solving. Chain-of-Thought (CoT) has been proposed to mitigate these challenges by leveraging step-by-step reasoning, leading to significant performance improvements. The authors draw parallels between CoT and deep transformers, proposing that the CoT mechanism can approximate a deeper model's capacity.

Method

The CoTFormer architecture is predicated on the generation of intermediary tokens that carry intermediate reasoning steps akin to CoT. This mechanism enables the model to achieve a depth-like effect through iterative token addition and attention. The process involves multiple passes of the initial input through the model, generating new tokens at each pass and interleaving them with the existing tokens. This iterative process results in a sequence where newer tokens can attend to previous ones, mimicking the depth of a multi-layer transformer.

The proposed method involves generating the next sequence of tokens by allowing each token to contribute additional intermediary tokens, which can directly attend to prior tokens. This attention mechanism distinguishes CoTFormer from traditional methods such as Block Universal Transformers, which apply recursive layer application without incorporating the intermediary token attention.

Experiments

The empirical evaluation was conducted on two language modeling datasets: PG19 and OpenWebText2. The experiments benchmarked the CoTFormer against standard transformers and Block Universal Transformers. Training settings included a fixed number of steps, the AdamW optimizer, and a learning rate schedule.

The results, summarized in Table 1 and Figure 1, illustrate the efficacy of CoTFormer. Notably, a CoTFormer with 24 layers and an interleaving factor of 2 ( $n=2$ ) outperformed a standard 48-layer transformer. This outcome underscores the importance of allowing attention to intermediary tokens, as CoTFormer consistently outperformed the Block Universal Transformer for a given depth. For instance, the CoTFormer significantly reduced the perplexity on the OpenWebText2 dataset compared to models of equivalent computational depth.

Implications and Future Work

The findings suggest that CoTFormer can achieve superior performance with fewer layers by leveraging the generation of intermediary tokens and their respective attention mechanisms. This approach offers practical implications, especially where memory limitations dominate and model size must be minimized without compromising performance. While the computational cost remains high due to the quadratic complexity of attention mechanisms, the significant reduction in model size could be beneficial in resource-constrained scenarios.

Future research directions include exploring depth-adaptive architectures, where the number of intermediary token passes varies among tokens, potentially further enhancing performance. The authors also suggest optimizing the efficiency of CoTFormer to better balance computational overhead and performance gains.

Conclusion

The CoTFormer presents an innovative method that bridges the gap between traditional deep transformers and CoT mechanisms, providing a means to build performant, yet shallower, models. By emphasizing the ability to allow attention to intermediary tokens, CoTFormer demonstrates a compelling advantage over existing architectures like Block Universal Transformers. This work offers meaningful insights into enhancing transformer models and lays the groundwork for future exploration into efficient, scalable model architectures.