Papers
Topics
Authors
Recent
Search
2000 character limit reached

Slamming: Training a Speech Language Model on One GPU in a Day

Published 19 Feb 2025 in cs.LG, cs.AI, cs.CL, cs.SD, and eess.AS | (2502.15814v2)

Abstract: We introduce Slam, a recipe for training high-quality Speech LLMs (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .

Summary

  • The paper introduces 'Slamming,' a set of training strategies enabling the creation of high-quality Speech Language Models (SLMs) on a single GPU within 24 hours.
  • Empirical results show Slamming achieves comparable or superior performance on benchmarks like sBLIMP and GenPPL by utilizing synthetic data and efficient model architectures.
  • This methodology significantly reduces computational barriers to SLM training, democratizing research and promoting more sustainable AI practices for academic labs and developers.

Analysis of "Slamming: Training a Speech LLM on One GPU in a Day"

This paper introduces "Slam," a method designed to train high-quality Speech LLMs (SLMs) on a limited computational budget, specifically using a single academic GPU within a 24-hour period. The authors systematically explore a range of training strategies and techniques aimed at optimizing performance within such constraints, while also empirically demonstrating scalability to more computational resources.

Key Contributions and Methodology

  1. Training Strategy: The authors investigate the influence of various training components, including model initialization and architecture, synthetic training data, preference optimization, and hyperparameter tuning. They derive a comprehensive training recipe to maximize model performance while adhering to a fixed computational budget.
  2. Empirical Insights: The paper emphasizes that utilizing synthetic data and incorporating diverse efficiency optimizations can significantly enhance SLM performance. The proposed "Slamming" process involves exploring variants in model initialization leveraging text-based pre-trained models, thereby enhancing convergence and performance. Key components include:
    • Utilizing TWIST initialization and Qwen2.5 architecture for improved performance.
    • Employing synthetic datasets generated via Text-to-Speech (TTS) methodologies.
    • Preference optimization through synthetic data to improve alignment with semantic tasks.
  3. Performance Evaluation: The study introduces several evaluation metrics to comprehensively assess SLM performance, including sBLIMP, Spoken Story Cloze (sSC), Topic Story-Cloze (tSC), and Generative Perplexity (GenPPL). The authors benchmark their approach against existing state-of-the-art models, achieving comparable or superior outcomes with significantly reduced computational resources.

Notable Results

  • The Slam methodology demonstrates that it is feasible to train SLMs with high effectiveness on a highly constrained budget. Using the Qwen2.5-0.5B model initializes through TWIST, the authors report improvements across the board on established benchmarks for SLM evaluation.
  • The inclusion of synthetic training data, such as sTinyStories, was found to significantly boost both modeling and generative performance, showcasing the utility of synthetic datasets in low-compute setups.
  • Performance comparisons show that Slam not only rivals but can exceed the predicted computationally optimal performances projected under traditional SLM scaling laws.

Implications and Future Directions

The results presented in this paper have profound implications for making SLM training more democratized and accessible, allowing smaller academic labs to participate in cutting-edge speech model research. The open-source release of code and data facilitates further exploration and validation in broader contexts, potentially spurring innovation in the SLM field.

The findings underscore the importance of adaptive learning strategies and the growing relevance of synthetic data in training SLMs efficiently. Developers and researchers could leverage these insights to optimize training pipelines under constrained environments, promoting more sustainable AI practices.

Conclusion

The paper provides a valuable contribution to the field of SLM research, challenging traditional notions of resource-intensive model training. By aligning training efficiency with advanced techniques such as preference optimization and exploiting synthetic data, the study carves out a pragmatic path for the future of speech-language modeling.

In sum, the "Slamming" approach delineated in this work not only suggests a method to reduce the cost and barriers to engaging in SLM training but paves the way for scalable and feasible exploration in the field, encouraging new norms in resource allocation for AI research.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 69 likes about this paper.