ReLoRA: High-Rank Training Through Low-Rank Updates

Published 11 Jul 2023 in cs.CL and cs.LG | (2307.05695v4)

Abstract: Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer LLMs with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.

Abstract PDF HTML Upgrade to Chat

References (51)

Citations (67)

View on Semantic Scholar

Summary

The paper introduces ReLoRA, a method that incrementally updates network parameters with low-rank approximations to achieve efficient high-rank training.
It employs periodic resets and a customized learning rate schedule to overcome training challenges and maintain performance similar to full network updates.
Experiments with transformer models show reduced GPU memory usage and faster training times, making the approach sustainable and scalable.

Introducing ReLoRA

Research in AI demonstrates a trend towards training larger networks, a costly initiative that requires vast computational resources. This paper presents an alternative approach to training these overparameterized models efficiently – ReLoRA, or Regularized Low-Rank Approximation. ReLoRA facilitates the training of large, high-rank neural networks by strategically updating the network through a sequence of low-rank approximations.

The Mechanics of ReLoRA

ReLoRA is grounded in the principle that the rank of the sum of two matrices is lower than or equal to the sum of their respective ranks. The method begins with a low-rank parameterization technique, LoRA, and builds upon that by consecutively applying low-rank updates to the network parameters. Iteratively merging these updates and reinitializing the network's trainable parameters incrementally raise the effective rank of the model.

Unlike conventional stochastic gradient descent methods, ReLoRA modifies the traditional optimization approach to accommodate its unique update process. By introducing resets at specified intervals to both the network parameters and the optimizer states, as well as employing a customized learning rate schedule, ReLoRA overcomes the challenges posed by its novel training methodology.

Experimentation and Findings

The efficiency of ReLoRA was rigorously tested on transformer LLMs equipped with up to 1.3 billion parameters. Despite a reduction in the number of trainable parameters during most of the training process, ReLoRA achieved performance comparable to full network training. Impressively, not only did the technique save substantial GPU RAM per device, it also sped up the training process by percentages that varied with the model size and hardware configuration.

Sustainable and Scalable AI Training

This method provides an economically viable solution for training large neural networks. By leveraging a blend of full-rank early training and subsequent low-rank updates, ReLoRA allows for significant improvements in memory savings and training speed. Furthermore, the benefits of ReLoRA become even more pronounced on less advanced hardware, widening its application to a broader spectrum of AI research groups.

In conclusion, ReLoRA ushers in a technique that improves upon the efficiency of existing parameter-efficient fine-tuning methods. As the research community continues to scale AI models, ReLoRA offers a promising pathway to more accessible and sustainable training, potentially revolutionizing the way we approach the development of large neural networks.