On the Inductive Bias of Stacking Towards Improving Reasoning

Published 27 Sep 2024 in cs.CL, cs.AI, and cs.LG | (2409.19044v1)

Abstract: Given the increasing scale of model sizes, novel training strategies like gradual stacking [Gong et al., 2019, Reddi et al., 2023] have garnered interest. Stacking enables efficient training by gradually growing the depth of a model in stages and using layers from a smaller model in an earlier stage to initialize the next stage. Although efficient for training, the model biases induced by such growing approaches are largely unexplored. In this work, we examine this fundamental aspect of gradual stacking, going beyond its efficiency benefits. We propose a variant of gradual stacking called MIDAS that can speed up LLM training by up to 40%. Furthermore we discover an intriguing phenomenon: MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks, especially tasks that require reasoning abilities like reading comprehension and math problems, despite having similar or slightly worse perplexity compared to baseline training. To further analyze this inductive bias, we construct reasoning primitives -- simple synthetic tasks that are building blocks for reasoning -- and find that a model pretrained with stacking is significantly better than standard pretraining on these primitives, with and without fine-tuning. This provides stronger and more robust evidence for this inductive bias towards reasoning. These findings of training efficiency and inductive bias towards reasoning are verified at 1B, 2B and 8B parameter LLMs. Finally, we conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models and provide strong supporting empirical analysis.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces MIDdle grAdual Stacking, demonstrating that incremental layer training improves reasoning in large language models.
It empirically shows that models from 1B to 8B parameters achieve competitive performance on math word problems and reading comprehension tasks.
The study reveals that preserving functional diversity via mid-layer duplication enhances inductive bias without increasing pretraining perplexity.

Inductive Bias of Stacking in LLM Training

The paper focuses on a nuanced exploration of the gradual stacking technique in LLMs, which has emerged as a promising training strategy. Gradual stacking offers a methodology for efficiently training deep learning models by incrementally increasing their depth during the training process. The approach involves using layers from smaller, earlier models to initialize subsequent stages in deeper models, potentially reducing computational resources and time.

Introduction and Motivation

The authors introduce a variant of gradual stacking—MIDdle grAdual Stacking—which aims not just for efficiency but also to explore the inductive bias introduced by stacking methods. Traditional gradual stacking methods have primarily focused on efficiency, replicating the last layers of smaller models into larger models. This can inadvertently alter the natural role of layers in transformer architectures, a concern addressed by this study.

Methodological Advancements

The MIDdle grAdual Stacking approach is posited to better retain the layered functional diversity typically inherent in deep learning models by duplicating more central layers as opposed to terminal ones. This method suggests a potential structural alignment with looped models that share parameters across layers and theoretically emulate iterative computational processes.

Empirical Findings

The authors conducted extensive empirical evaluations on models ranging from 1B to 8B parameters. Their results underscore the effectiveness of MIDdle grAdual Stacking both in training efficiency and downstream task performance. Notably, the method maintains or even surpasses baseline training methods in terms of downstream performance despite exhibiting similar perplexity metrics. This is especially pronounced in reasoning-intensive tasks such as math word problems and reading comprehension.

Implications of Inductive Bias

The inductive bias introduced by stacking techniques appears to confer enhanced reasoning capabilities onto models without necessitating improvements in the generalization of the pretraining perplexity. This suggests that the structural alterations inherent in gradual stacking allow models to distill skills more effectively from pretraining data.

Supplementary Experiments

The study introduces synthetic tasks termed "reasoning primitives" to isolate and examine the extent of this bias. These tasks, including induction copying and variable assignment, serve as proxies to test basic reasoning functions. The findings consistently demonstrate superior performance of models trained with MIDdle grAdual Stacking on these primitives.

Future Directions

The paper posits that further research is warranted to explore the interplay between reasoning and memorization within LLMs. The connection between stacking strategies and looped models presents an opportunity for enhancing reasoning capabilities further. Understanding how specific training regimes influence the inductive biases of neural networks could lead to the development of models with advanced contextual understanding and reasoning abilities.

In conclusion, this paper presents a significant exploration of training strategies beyond efficiency, emphasizing the importance of understanding the biases they introduce. Such insights are crucial for advancing our theoretical understanding and practical deployment of LLMs in diverse applications.