Ultra Memory-Efficient On-FPGA Training of Transformers via Tensor-Compressed Optimization

Published 11 Jan 2025 in cs.LG, cs.AR, and cs.CL | (2501.06663v2)

Abstract: Transformer models have achieved state-of-the-art performance across a wide range of machine learning tasks. There is growing interest in training transformers on resource-constrained edge devices due to considerations such as privacy, domain adaptation, and on-device scientific machine learning. However, the significant computational and memory demands required for transformer training often exceed the capabilities of an edge device. Leveraging low-rank tensor compression, this paper presents the first on-FPGA accelerator for end-to-end transformer training. On the algorithm side, we present a bi-directional contraction flow for tensorized transformer training, significantly reducing the computational FLOPS and intra-layer memory costs compared to existing tensor operations. On the hardware side, we store all highly compressed model parameters and gradient information on chip, creating an on-chip-memory-only framework for each stage in training. This reduces off-chip communication and minimizes latency and energy costs. Additionally, we implement custom computing kernels for each training stage and employ intra-layer parallelism and pipe-lining to further enhance run-time and memory efficiency. Through experiments on transformer models within $36.7$ to $93.5$ MB using FP-32 data formats on the ATIS dataset, our tensorized FPGA accelerator could conduct single-batch end-to-end training on the AMD Alevo U50 FPGA, with a memory budget of less than $6$-MB BRAM and $22.5$-MB URAM. Compared to uncompressed training on the NVIDIA RTX 3090 GPU, our on-FPGA training achieves a memory reduction of $30\times$ to $51\times$. Our FPGA accelerator also achieves up to $3.6\times$ less energy cost per epoch compared with tensor Transformer training on an NVIDIA RTX 3090 GPU.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an FPGA accelerator that employs low-rank tensor compression to significantly reduce memory and computational demands during transformer training.
The innovative bi-directional tensor contraction method minimizes sequential operations, enhancing parallelism and computational efficiency.
The hardware implementation achieves a 30-51x reduction in memory usage and a fourfold decrease in energy consumption compared to GPU-based training.

Ultra Memory-Efficient On-FPGA Training of Transformers via Tensor-Compressed Optimization

Introduction

The paper presents an FPGA accelerator tailored for training transformer models using low-rank tensor compression. This work addresses the computational and memory constraints typically associated with transformer models, enabling their training on resource-constrained devices such as FPGAs. Specifically, the research introduces a bi-directional contraction flow to minimize FLOPs and intra-layer memory requirements during tensorized transformer training. Custom computing kernels and on-chip memory storage further enhance the accelerator's efficacy. The system demonstrates significant memory and energy savings compared to traditional training methods on GPUs.

Tensorized Transformer Architecture

The accelerator leverages the inherent structure of transformers, employing tensor decompositions to reduce the dimensionality and storage requirements of the weight matrices (Figure 1). Transformer models, primarily composed of self-attention layers and feed-forward networks, benefit from the tensor-train (TT) and tensor-train-matrix (TTM) formats to compress and store parameters efficiently. These formats replace standard matrix operations with tensor contractions, which significantly decrease both on-chip memory usage and computational costs.

Figure 1: Transformer structure for classification tasks. Inter-layer activation is represented using yellow blocks, embedding tables and linear layer weights are in purple blocks, and non-linear functions are in white blocks.

Bi-directional Contraction Scheme

The innovative bi-directional tensor-train (BTT) contraction method enhances computation efficiency by reducing the number of sequential contraction steps required in tensor operations (Figure 2). Traditional methods adopt a right-to-left contraction sequence, which limits parallelism. By employing BTT, the accelerator achieves substantial improvements in computational complexity and memory efficiency. These enhancements are achieved by concurrently contracting left and right tensors towards the middle, thus reducing the computational stages from $2d$ to $d+1$ .

Figure 2: Comparison of the computing flow of the TT-format and our modified BTT forward propagation when $d=2$ . Contraction operations are represented in blue multipliers. Here white nodes represent input tensor $X$ and output tensor $Y$ .

Key Hardware Implementation

The FPGA implementation integrates a memory management strategy that optimizes on-chip memory usage by grouping tensor cores to maximize BRAM utilization (Figure 3). The accelerator supports tensor-compressed forward and backward propagation, facilitated by the innovative dataflow and kernel execution schemes. This strategy, alongside pipelining and kernel fusion, enhances the parallelism of tensor operations and reduces overall latency.

Figure 3: Configurations of BRAM 36K. Number of BRAM of one array. BRAM usage efficiency before and after tensor grouping.

Performance Evaluation

Experiments demonstrate that the FPGA accelerator achieves a memory reduction factor of 30 to 51 times compared to uncompressed training on an NVIDIA RTX 3090 GPU, with a fourfold decrease in energy consumption per training epoch (Figure 4). These metrics underline the practical viability of deploying large-scale tensorized neural networks on edge devices without the prohibitive energy and memory footprints of conventional approaches.

Figure 4: Computational and memory costs of TTM-based contraction, TT-based contraction and our BTT-based contraction corresponding to different sequence lengths (top) and ranks (bottom).

Conclusion

This study effectively demonstrates the potential of low-rank tensor compression for on-FPGA training of transformer models. By addressing both computational and memory constraints, the proposed accelerator bridges the gap between high-efficiency training and edge deployment, paving the way for future research in tensor-based model optimization on resource-limited platforms. The preliminary results encourage further exploration into scaling these techniques to more complex models and real-world applications.

Markdown Report Issue