Papers
Topics
Authors
Recent
Search
2000 character limit reached

TernaryLLM: Low-Bit Language Models

Updated 12 December 2025
  • TernaryLLM is a large language model with weights quantized to {-1, 0, +1}, drastically reducing memory footprint and eliminating most floating-point multiplications.
  • It leverages advanced post-training and quantization-aware training methods along with innovative packing schemes and hardware accelerators for efficient inference.
  • Empirical benchmarks show TernaryLLMs retain over 90% of baseline accuracy while achieving significant speedups and energy efficiency gains across diverse platforms.

A TernaryLLM is a LLM in which the majority of weights are quantized to a ternary alphabet, typically {1,0,+1}\{-1, 0, +1\}, and encoded using dense sub-2-bit representations (e.g., 1.6 or 2 bits/weight). These models achieve a drastic reduction in memory footprint and remove most floating-point multiplications from inference, while preserving a high degree of model expressiveness and accuracy. TernaryLLMs exploit advances in post-training quantization, quantization-aware training, hardware design (CPU, GPU, FPGA, ASIC), and information-theoretically motivated schemes to realize LLM inference at orders-of-magnitude lower computational cost than full-precision or even 4-bit models.

1. Mathematical Foundations of Ternary Quantization

The core operation in TernaryLLMs is the quantization of neural weights to the ternary set {1,0,+1}\{-1, 0, +1\}. The forward path of a linear or projection layer with floating-point weights WRn×dW\in\mathbb{R}^{n\times d} is approximated as:

WW~=αTW \approx \widetilde{W} = \alpha \cdot T

where T{1,0,1}n×dT \in \{-1,0,1\}^{n\times d} and α\alpha is a learnable or derived scaling factor (often applied per-row, per-column, or per-group) (Chen et al., 2024, Xiao et al., 21 Sep 2025, Vaidhya et al., 28 Jun 2025).

Several quantization procedures are in use:

  • Hard thresholding: Tij=sign(Wij)T_{ij} = \text{sign}(W_{ij}) if Wij>Δ|W_{ij}| > \Delta, Tij=0T_{ij}=0 otherwise. The scale α\alpha is set to minimize {1,0,+1}\{-1, 0, +1\}0 (Qiao et al., 22 Apr 2025, Yin et al., 23 Feb 2025).
  • Dual Learnable Ternarization (DLT): Both scale {1,0,+1}\{-1, 0, +1\}1 and shift {1,0,+1}\{-1, 0, +1\}2 parameters are learned for each group, allowing the quantized-and-reconstructed weight to be {1,0,+1}\{-1, 0, +1\}3 (Chen et al., 2024).
  • Structured Trit-Plane Decomposition: Advanced schemes such as PTQTP represent every row of {1,0,+1}\{-1, 0, +1\}4 as a sum of two ternary planes weighted by learned scales:

{1,0,+1}\{-1, 0, +1\}5

yielding an effective storage of {1,0,+1}\{-1, 0, +1\}6 bits/weight, or 1.585 bits per plane (Xiao et al., 21 Sep 2025).

  • Signed-Zero Ternary (SZT): Encodes four states (using two bits), allowing additional sign information for sub-threshold weights, improving gradient flow and information density (Uhlmann, 8 Aug 2025).

Activations are usually left in higher precision (e.g., FP16 or INT8), as quantizing activations to ternary remains an outstanding challenge due to heavy-tailed distributions and significant dynamic range (Chen et al., 2024, Xiao et al., 21 Sep 2025).

2. Quantization Methodologies: Post-Training and Quantization-Aware Training

Two principal quantization strategies are prominent:

Knowledge distillation and fine-tuning techniques (e.g., LoTA-QAF) employ low-rank trainable adapters in the ternary domain, supporting lossless merging and integer-only inference (Chen et al., 24 May 2025).

3. Packing Schemes and Hardware Implementation

Efficiently storing and operating over ternary weights is critical for realizing the theoretical savings. Key approaches:

  • Bit-packing: Blocks of 5 ternary values ({1,0,+1}\{-1, 0, +1\}7) are packed into a single 8-bit byte, yielding 1.6 bits/weight ("TQ1" scheme); using two bits per value ("TQ2") reaches 2 bits/weight. These methods are implemented both on CPUs and GPUs for fast unpacking and high memory bandwidth utilization (Vaidhya et al., 28 Jun 2025, Huang et al., 17 Sep 2025).
  • Matrix-vector multiplication (GEMM): Inference kernels are redesigned to exploit the ternary structure:
  • Indexing algorithms: For fixed ternary weight matrices, block-indexed GEMV algorithms achieve {1,0,+1}\{-1, 0, +1\}8 time and memory by precomputing permutation and segmentation indices, with up to {1,0,+1}\{-1, 0, +1\}9 speedup and WRn×dW\in\mathbb{R}^{n\times d}0 memory reduction in software-only settings (Dehghankar et al., 2024).
Packing Method Bits/Weight Packing Unit Main Platform
2-bit ("TQ2") 2 k=256 CPU, GPU
1.6-bit ("TQ1") 1.6 k=5 CPU, FPGA, ASIC
PTQTP Trit-Plane 3.17 group=128 GPU, FPGA, ASIC

4. Empirical Scaling Laws and Model Behavior

Recent empirical analysis reveals that ternary models exhibit distinctly different scaling behavior compared to their full-precision counterparts. For ternary LLMs (TriLMs) (Vaidhya et al., 28 Jun 2025): WRn×dW\in\mathbb{R}^{n\times d}1 where WRn×dW\in\mathbb{R}^{n\times d}2 is parameters (M), WRn×dW\in\mathbb{R}^{n\times d}3 pretraining tokens (B). The data exponent (WRn×dW\in\mathbb{R}^{n\times d}4) dominates the parameter exponent (WRn×dW\in\mathbb{R}^{n\times d}5), implying that expanding the dataset, rather than the model size, yields greater returns for ternary LLMs at fixed FLOPs.

For FloatLMs, the exponents are nearly matched (WRn×dW\in\mathbb{R}^{n\times d}6, WRn×dW\in\mathbb{R}^{n\times d}7).

A practical implication is that TernaryLLMs should allocate training computation towards increasing data rather than model width/depth, diverging from established scaling rules for float-precision models.

5. Accuracy-Complexity Tradeoffs and Experimental Results

Comprehensive benchmarks show that TernaryLLMs typically retain WRn×dW\in\mathbb{R}^{n\times d}8 of baseline FP16 accuracy at 1.58 bits/weight, and dramatically outperform earlier binary or poorly compensated ternary/PTQ methods.

  • On Qwen3-14B, PTQTP achieves WRn×dW\in\mathbb{R}^{n\times d}9 retention of mathematical reasoning test accuracy compared to FP16, versus WW~=αTW \approx \widetilde{W} = \alpha \cdot T0 for baseline 3-bit GPTQ under the same conditions (Xiao et al., 21 Sep 2025).
  • LLaMA-3-8B with QAT (DLT+OFF) matches or outperforms 2-bit quantization, reaching WW~=αTW \approx \widetilde{W} = \alpha \cdot T1 higher zero-shot accuracy than the best 2-bit method (Chen et al., 2024).
  • Language modeling perplexity increases <0.5 PPL for BitNet-1.58 quantization (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025).
  • For quantization-aware fine-tuning, LoTA-QAF enables lossless merging of ternary adapters, recovering or surpassing LoRA (16-bit) accuracy by up to WW~=αTW \approx \widetilde{W} = \alpha \cdot T2 on downstream tasks (Chen et al., 24 May 2025).
  • FPGA and ASIC accelerators using optimized ternary GEMM consistently deliver WW~=αTW \approx \widetilde{W} = \alpha \cdot T3 end-to-end speedup and WW~=αTW \approx \widetilde{W} = \alpha \cdot T4 energy efficiency over A100-class GPUs (Huang et al., 17 Sep 2025, Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025, Yin et al., 23 Feb 2025).

6. Hardware Integration and Edge Deployment

TernaryLLMs are highly amenable to deployment on resource-constrained hardware due to their uniform, low-bit arithmetic and multiplication-free operations:

  • Edge FPGAs: Engines such as TeLLMe and TerEffic store weights on-chip or in HBM, implement pipelined table-lookup matmul, and achieve WW~=αTW \approx \widetilde{W} = \alpha \cdot T5 the throughput and WW~=αTW \approx \widetilde{W} = \alpha \cdot T6 the efficiency of Jetson-class SoCs at equivalent or lower power (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025, Yin et al., 23 Feb 2025).
  • ASICs: TENET-ASIC deploys a heterogeneous architecture (Sparse Ternary LUT arrays plus FP16 attention blocks), reaching WW~=αTW \approx \widetilde{W} = \alpha \cdot T7 end-to-end inference speedup and WW~=αTW \approx \widetilde{W} = \alpha \cdot T8 higher energy efficiency than the NVIDIA A100 GPU, supported by custom 1.6-bit packing and decompression (Huang et al., 17 Sep 2025).
  • CPUs/GPUs: Dedicated CPU kernels and the TriRun CUDA kernel unlock prompt and decode speedups of 1.5–7.9×, with dense or sparse storage for ternary weights (Lipshitz et al., 8 Oct 2025, Vaidhya et al., 28 Jun 2025).

7. Information-Theoretic and Theoretical Advances

TernaryLLM quantization is increasingly positioned as an information-theoretically optimal representation under resource constraints.

  • Entropy: Log-base-three entropy yields WW~=αTW \approx \widetilde{W} = \alpha \cdot T9 bits/trit, realized asymptotically by 1.6-bit packing (Vaidhya et al., 28 Jun 2025, Uhlmann, 8 Aug 2025).
  • SZT encoding adds “signed-zero” states, recovering additional redundancy available in the unused 2-bit codeword, greatly enhancing gradient feedback for sub-threshold weights, reducing mean-squared-error in the STE, and tightening PAC–Bayes bounds (Uhlmann, 8 Aug 2025).
  • Convergence dynamics: Progressive trit-plane and DLT-decompositions are theoretically guaranteed to converge monotonically, with bounded scaling parameters (Xiao et al., 21 Sep 2025).

TernaryLLMs thus represent not only an engineering compromise for edge or memory-bounded deployments, but also a theoretically motivated, rigorously analyzed quantization regime.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TernaryLLM.