Stacking Small Language Models for Generalizability

Published 21 Oct 2024 in cs.CL, cs.AI, and cs.LG | (2410.15570v1)

Abstract: Recent advances show that LLMs generalize strong performance across different natural language benchmarks. However, the large size of LLMs makes training and inference expensive and impractical to run in resource-limited settings. This paper introduces a new approach called fine-tuning stacks of LLMs (FSLM), which involves stacking small LLMs (SLM) as an alternative to LLMs. By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for. As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language. By evaluating FSLM on common natural language benchmarks, this paper highlights promising early results toward generalizable performance using FSLM as a cost-effective alternative to LLMs.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces the FSLM framework, which stacks specialized small language models to mimic LLM performance while lowering computational costs.
It employs a modular, interpretable design using a hierarchy of Pythia models fine-tuned for distinct reasoning tasks.
Experimental results show improved zero-shot accuracy on benchmarks like tinyArc and tinyMMLU, highlighting its promise in resource-limited settings.

Stacking Small LLMs for Generalizability

The paper presented by Laurence Liang introduces an innovative framework, Fine-Tuning Stacks of LLMs (FSLM), aimed at enhancing the generalizability of small LLMs (SLMs) in resource-constrained environments. This approach is positioned as a viable alternative to LLMs, which, despite their superior performance on various natural language benchmarks, pose significant computational and financial challenges.

Core Contributions and Methodology

The FSLM framework leverages a hierarchy of small, specialized LLMs, each fine-tuned for a specific task, to mimic the nuanced performance of LLMs. The design intention is comparable to compartmentalization in human cognition, where each SLM is tasked with a unique aspect of the reasoning process, thus decreasing the computational burden typically associated with LLMs. This modular approach not only enhances interpretability, with natural language being the mode of communication between layers, but also promises reduced training and inference costs.

The FSLM stack discussed in the paper consists of four Pythia models, each with 160 million parameters. The experimental results on benchmarks such as TinyBenchmarks show that FSLM stacks exhibit performance comparable to existing models of similar scale, specifically outperforming standalone models of equivalent parameter size in certain tasks. Notably, the FSLM stack demonstrated a zero-shot accuracy increase on benchmarks like tinyArc and tinyMMLU.

Implications and Future Directions

While the results illustrate that the FSLM stack is a promising approach for small-scale, interpretable language processing in limited-resource environments, there is potential for refining the framework further. The scalability of the approach, with outstanding model performance achievable even with decreased model sizes, highlights an impactful step toward democratizing access to powerful language processing tools.

Future research should explore the integration of varied pre-training methods and datasets to evaluate their impact on FSLM's performance. Moreover, elucidating the influence of training strategies such as model temperature and sampling techniques on output consistency could present opportunities for fine-tuning the framework's effectiveness. Enhancing the range of benchmarks would provide a broader validation of this model's capabilities.

Conclusion

In summary, by focusing on the interplay between model specialization and task decomposition, this research contributes to the ongoing exploration of efficient AI implementations. FSLM's ability to maintain accuracy while reducing computational demands highlights the potential for such stacked architectures to be advantageous in diverse applications. Continued advancements in this area could lead to notable improvements in the accessibility and applicability of LLMs in practical use cases across globally distributed, compute-constrained environments. This work stands as a compelling exploration of how small, cooperating entities can collectively approach the robustness and versatility usually reserved for their larger counterparts in the field of AI.

Markdown Report Issue