Papers
Topics
Authors
Recent
Search
2000 character limit reached

Merino: Entropy-driven Design for Generative Language Models on IoT Devices

Published 28 Feb 2024 in cs.LG, cs.AI, and cs.CL | (2403.07921v3)

Abstract: Generative LLMs stand as a revolutionary advancement in the modern era of AI. However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative LLMs. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Attention is all you need. ArXiv, abs/1706.03762, 2017.
  2. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
  3. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
  4. Language models are unsupervised multitask learners. 2019.
  5. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  6. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022.
  7. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  8. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
  9. A short study on compressing decoder-based language models. ArXiv, abs/2110.08460, 2021.
  10. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019.
  11. Patient knowledge distillation for bert model compression. In Conference on Empirical Methods in Natural Language Processing, 2019.
  12. Automated machine learning: Methods, systems, challenges. Automated Machine Learning, 2019.
  13. Hat: Hardware-aware transformers for efficient natural language processing. ArXiv, abs/2005.14187, 2020.
  14. Nas-bert: Task-agnostic and adaptive-size bert compression with neural architecture search. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021.
  15. Autotinybert: Automatic hyper-parameter optimization for efficient pre-trained language models. In Annual Meeting of the Association for Computational Linguistics, 2021.
  16. Once for all: Train one network and specialize it for efficient deployment. ArXiv, abs/1908.09791, 2019.
  17. Proxylessnas: Direct neural architecture search on target task and hardware. ArXiv, abs/1812.00332, 2018.
  18. Evaluating efficient performance estimators of neural architectures. In Neural Information Processing Systems, 2020.
  19. Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620–630, 1957.
  20. Tinybert: Distilling bert for natural language understanding. ArXiv, abs/1909.10351, 2019.
  21. Litetransformersearch: Training-free on-device search for efficient autoregressive language models. ArXiv, abs/2203.02094, 2022.
  22. Redunet: A white-box deep network from the principle of maximizing rate reduction. ArXiv, abs/2105.10446, 2021.
  23. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019, 2018.
  24. Deepmad: Mathematical architecture design for deep convolutional neural network. ArXiv, abs/2303.02165, 2023.
  25. Mae-det: Revisiting maximum entropy principle in zero-shot nas for efficient object detection. In International Conference on Machine Learning, 2021.
  26. Elements of information theory. 1991.
  27. The principles of deep learning theory. ArXiv, abs/2106.10165, 2021.
  28. Improving deep transformer with depth-scaled initialization and merged attention. ArXiv, abs/1908.11365, 2019.
  29. Improving transformer optimization through better initialization. In International Conference on Machine Learning, 2020.
  30. Transformers without tears: Improving the normalization of self-attention. ArXiv, abs/1910.05895, 2019.
  31. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
  32. Learning deep transformer models for machine translation. In Annual Meeting of the Association for Computational Linguistics, 2019.
  33. Rezero is all you need: Fast convergence at large depth. ArXiv, abs/2003.04887, 2020.
  34. Scaling language models: Methods, analysis & insights from training gopher. ArXiv, abs/2112.11446, 2021.
  35. The depth-to-width interplay in self-attention. arXiv: Learning, 2020.
  36. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013.
  37. Training-free transformer architecture search. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10884–10893, 2022.
  38. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942, 2019.
  39. Colin R. Reeves. Evolutionary computation: a unified approach. Genetic Programming and Evolvable Machines, 8:293–295, 2007.
  40. Pythia: A suite for analyzing large language models across training and scaling. 2023.
  41. The pile: An 800gb dataset of diverse text for language modeling. ArXiv, abs/2101.00027, 2020.
  42. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
  43. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019.
  44. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64:99–106, 2019.
  45. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018.
  46. Pubmedqa: A dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing, 2019.
  47. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In International Joint Conference on Artificial Intelligence, 2020.
  48. Superglue: A stickier benchmark for general-purpose language understanding systems. ArXiv, abs/1905.00537, 2019.
  49. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. 2023.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 2 likes about this paper.