Multiplication-Free Transformer Training via Piecewise Affine Operations

Published 26 May 2023 in cs.LG | (2305.17190v2)

Abstract: Multiplications are responsible for most of the computational cost involved in neural network training and inference. Recent research has thus looked for ways to reduce the cost associated with them. Inspired by Mogami (2020), we replace multiplication with a cheap piecewise affine approximation that is achieved by adding the bit representation of the floating point numbers together as integers. We show that transformers can be trained with the resulting modified matrix multiplications on both vision and language tasks with little to no performance impact, and without changes to the training hyperparameters. We further replace all non-linearities in the networks making them fully and jointly piecewise affine in both inputs and weights. Finally, we show that we can eliminate all multiplications in the entire training process, including operations in the forward pass, backward pass and optimizer update, demonstrating the first successful training of modern neural network architectures in a fully multiplication-free fashion.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces piecewise affine operations to fully replace multiplications in transformer training.
It applies the method to language and vision models, achieving competitive accuracy with reduced computational cost.
The approach offers significant energy and hardware efficiency benefits, suggesting promising developments for low-power AI accelerators.

Multiplication-Free Transformer Training via Piecewise Affine Operations

The paper under review introduces a novel approach to reducing the computational cost of neural network training by eliminating the need for multiplications, which are computationally expensive. The authors propose the use of piecewise affine approximations—a technique that replaces point-wise multiplications with simpler operations—to achieve this cost reduction. They demonstrate this strategy effectively on transformer architectures, which are pivotal in both vision and language processing, without substantial degradation in model performance.

Key Contributions

The primary innovation in the paper is the substitution of standard multiplications in neural networks with piecewise affine operations. These operations use the bitwise addition of floating-point numbers as integers, a technique inspired by prior work on deep learning without multiplications. The authors extend this idea to cover all operations within the transformer architecture, including non-linear activations, and notably achieve a training process entirely devoid of traditional multiplications. This includes the forward pass, backward pass, and optimizer updates.

The proposed piecewise affine multiplication (PAM) approximates standard multiplication by exploiting efficient logarithmic transformations. In hardware terms, this method promises significant energy and area savings, particularly beneficial for high-throughput applications like LLM training.

Experimental Results

The authors conduct experiments on several benchmarks to evaluate their approach. On the IWSLT14 German-to-English translation task, they observe minimal performance impact compared to a baseline transformer using standard arithmetic. Furthermore, the technique extends successfully to vision tasks such as CIFAR-10 and ImageNet classification with a Vision Transformer (DeiT-Tiny), demonstrating its versatility.

Comparison with alternative approaches like AdderNet shows that the proposed method maintains competitive accuracy while offering the potential for better efficiency in hardware implementations. This aligns with the theoretical expectation that replacing multiplications with approximate arithmetic operations can reduce computational cost without sacrificing significant model accuracy.

Implications and Future Directions

This research pushes the boundary of computational efficiency in neural network training, particularly for hardware-constrained environments or large-scale models. Reducing the reliance on multiplications not only lowers energy consumption but also potentially mitigates the environmental footprint associated with training LLMs.

From a theoretical perspective, the resulting networks, characterized by piecewise affine transformations, introduce unique properties in gradient computation and optimization landscapes. While this paper shows promising results with transformers, future research could explore the impacts of such architectural changes on deeper or more complex models and various types of neural architectures, such as recurrent networks or graph neural networks.

The successful implementation of PAM hinges significantly on hardware support, suggesting that developments in field-programmable gate arrays (FPGAs) and dedicated low-power AI accelerators could further enhance the viability of multiplication-free training. This opens an avenue for co-designing algorithms and hardware to fully exploit the benefits of reduced computational complexity.

Conclusion

The paper presents a compelling framework for training transformers without using multiplications by employing piecewise affine operations, achieving an efficient and feasible approach to modern neural network training. The innovation in arithmetic simplification presents not only a mean to economize computation but also a new perspective on neural network design principles. While hardware implementations are yet to fully catch up with these algorithmic advances, the potential for impactful applications in various domains is significant, warranting further exploration into multiplication-free machine learning models.

Markdown Report Issue