The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Published 27 Feb 2024 in cs.CL and cs.LG | (2402.17764v1)

Abstract: Recent research, such as BitNet, is paving the way for a new era of 1-bit LLMs. In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

Abstract PDF HTML Upgrade to Chat

References (28)

Citations (127)

View on Semantic Scholar

Summary

The paper demonstrates that a 1.58-bit LLM (BitNet b1.58) maintains performance comparable to FP16 models while drastically reducing computational costs.
It employs an absmean quantization function and LLaMA-like adjustments, achieving significant reductions in GPU memory usage, latency, and energy consumption.
The advancements pave the way for greener AI, broader deployment in resource-constrained environments, and future development of specialized hardware.

Advancements and Implications of 1.58-bit LLMs

Introduction to 1.58-bit Architecture

The domain of AI and, more specifically, LLMs has witnessed considerable evolution aimed at balancing performance with computational and environmental cost. A compelling stride in this direction is the innovation of 1-bit architectures, specifically the BitNet b1.58 model, which emerges as a noteworthy development. BitNet b1.58 moderates the conventional model weights to ternary values {-1, 0, 1}, thus reducing the model to an effective 1.58 bits per parameter as opposed to the 16-bit (FP16 or BF16) norms. This approach not only retains the model performance in tasks and perplexity but significantly enhances cost-effectiveness across latency, memory consumption, throughput, and energy use.

Quantization Function and Model Adjustments

The BitNet b1.58 employs an absmean quantization function that scales and rounds the weight matrices. This method, alongside adjustments like the lack of bias terms and the incorporation of LLaMA-like components, enables it to maintain compatibility with popular open-source platforms. The transition to such a model architecture brings forth substantial gains in reducing computational complexity, especially by favoring integer operations over floating-point computations prevalent in traditional LLM architectures.

Performance and Efficiency Gains

The empirical assessments comparing BitNet b1.58 to the FP16 LLaMA LLM benchmarks reveal significant findings:

Perplexity and Task Performance: At a model scale of 3B parameters, BitNet b1.58 matches the perplexity and task performance of its FP16 counterparts, with even superior performance noted at a 3.9B scale.
Cost Metrics: BitNet b1.58 showcases a remarkable reduction in GPU memory usage (up to 3.55 times less) and latency (up to 2.71 times faster) compared to LLaMA LLMs of comparable sizes.
Energy Consumption: A notable decrease in arithmetic operations energy consumption is observed, with BitNet b1.58 offering a 71.4 times reduction for matrix multiplication operations on 7nm chips when compared to traditional LLA Transformers.
Throughput: Increased batch sizes and throughput (up to 11 times the batch size and 8.9 times the throughput for a 70B model) were observed, indicating higher efficiency in processing without compromising on model quality.

Theoretical and Practical Implications

This research elucidates several key impacts:

Towards Greener AI: The development pushes the boundaries of creating more energy-efficient models, addressing one of the critical concerns in deploying sizable LLMs.
Enhancing Accessibility: The diminished resource requirements potentially lower the barrier for deploying advanced NLP capabilities on edge and mobile devices, broadening the application horizon of LLMs.
Future Hardware Development: It opens avenues for designing specialized hardware optimized for 1.58-bit or ternary architectures, hinting at more cost-efficient AI accelerators in the pipeline.

Future Prospects and Directions

Several areas are ripe for exploration following this advancement:

1-bit Mixture-of-Experts (MoE) LLMs: Integrating 1.58-bit architecture within MoE models could further enhance computational and deployment efficiency.
Support for Longer Sequences: Given the reduction in memory requirements, models like BitNet b1.58 set the stage for handling longer sequences more effectively, an ongoing challenge in the field.
Broadening Deployment Scenarios: The reduced footprint of such models opens up novel applications, particularly in resource-constrained environments like mobile and edge computing.
Dedicated Hardware for 1-bit LLMs: Inspired by this paradigm, there's a potential shift towards developing hardware that is intrinsically optimized for 1-bit and ternary computation models.

Conclusion

The BitNet b1.58 introduces a compelling alternative to traditional LLM architectures, providing a blend of high efficiency, reduced computational cost, and maintained performance. By pushing the frontiers of model quantization, this work not only sets a precedent for future research in space-efficient LLMs but also underscores the urgent imperative for sustainable AI practices. As we advance, the integration of these insights with emerging technologies and hardware could significantly transform the landscape of natural language processing and its applications.