TernaryLLM: Low-Bit Language Models

Updated 12 December 2025

TernaryLLM is a large language model with weights quantized to {-1, 0, +1}, drastically reducing memory footprint and eliminating most floating-point multiplications.
It leverages advanced post-training and quantization-aware training methods along with innovative packing schemes and hardware accelerators for efficient inference.
Empirical benchmarks show TernaryLLMs retain over 90% of baseline accuracy while achieving significant speedups and energy efficiency gains across diverse platforms.

A TernaryLLM is a LLM in which the majority of weights are quantized to a ternary alphabet, typically $\{-1, 0, +1\}$ , and encoded using dense sub-2-bit representations (e.g., 1.6 or 2 bits/weight). These models achieve a drastic reduction in memory footprint and remove most floating-point multiplications from inference, while preserving a high degree of model expressiveness and accuracy. TernaryLLMs exploit advances in post-training quantization, quantization-aware training, hardware design (CPU, GPU, FPGA, ASIC), and information-theoretically motivated schemes to realize LLM inference at orders-of-magnitude lower computational cost than full-precision or even 4-bit models.

1. Mathematical Foundations of Ternary Quantization

The core operation in TernaryLLMs is the quantization of neural weights to the ternary set $\{-1, 0, +1\}$ . The forward path of a linear or projection layer with floating-point weights $W\in\mathbb{R}^{n\times d}$ is approximated as:

$W \approx \widetilde{W} = \alpha \cdot T$

where $T \in \{-1,0,1\}^{n\times d}$ and $\alpha$ is a learnable or derived scaling factor (often applied per-row, per-column, or per-group) (Chen et al., 2024, Xiao et al., 21 Sep 2025, Vaidhya et al., 28 Jun 2025).

Several quantization procedures are in use:

Hard thresholding: $T_{ij} = \text{sign}(W_{ij})$ if $|W_{ij}| > \Delta$ , $T_{ij}=0$ otherwise. The scale $\alpha$ is set to minimize $\{-1, 0, +1\}$ 0 (Qiao et al., 22 Apr 2025, Yin et al., 23 Feb 2025).
Dual Learnable Ternarization (DLT): Both scale $\{-1, 0, +1\}$ 1 and shift $\{-1, 0, +1\}$ 2 parameters are learned for each group, allowing the quantized-and-reconstructed weight to be $\{-1, 0, +1\}$ 3 (Chen et al., 2024).
Structured Trit-Plane Decomposition: Advanced schemes such as PTQTP represent every row of $\{-1, 0, +1\}$ 4 as a sum of two ternary planes weighted by learned scales:

$\{-1, 0, +1\}$ 5

yielding an effective storage of $\{-1, 0, +1\}$ 6 bits/weight, or 1.585 bits per plane (Xiao et al., 21 Sep 2025).

Signed-Zero Ternary (SZT): Encodes four states (using two bits), allowing additional sign information for sub-threshold weights, improving gradient flow and information density (Uhlmann, 8 Aug 2025).

Activations are usually left in higher precision (e.g., FP16 or INT8), as quantizing activations to ternary remains an outstanding challenge due to heavy-tailed distributions and significant dynamic range (Chen et al., 2024, Xiao et al., 21 Sep 2025).

2. Quantization Methodologies: Post-Training and Quantization-Aware Training

Two principal quantization strategies are prominent:

Post-Training Quantization (PTQ): Applies quantization to a pretrained LLM (e.g., LLaMA, Qwen) without further gradient updates. Algorithms such as PTQTP use a monotonic, globally consistent, group-wise progressive approximation loop: alternating ridge regression updates for scale and exhaustive search for ternary assignments per group, with convergence guarantees (Xiao et al., 21 Sep 2025).
Quantization-Aware Training (QAT): Modifies the forward pass to simulate ternary weights and employs straight-through estimators for the backward pass, learning to compensate for quantization error during training (Vaidhya et al., 28 Jun 2025, Chen et al., 2024). DLT augments this process with learnable shifts to better fit asymmetric weight distributions, while Outlier-Friendly Feature Distillation (OFF) guides the quantized student toward teacher representations using cosine similarity, addressing information loss due to extreme quantization (Chen et al., 2024).

Knowledge distillation and fine-tuning techniques (e.g., LoTA-QAF) employ low-rank trainable adapters in the ternary domain, supporting lossless merging and integer-only inference (Chen et al., 24 May 2025).

3. Packing Schemes and Hardware Implementation

Efficiently storing and operating over ternary weights is critical for realizing the theoretical savings. Key approaches:

Bit-packing: Blocks of 5 ternary values ( $\{-1, 0, +1\}$ 7) are packed into a single 8-bit byte, yielding 1.6 bits/weight ("TQ1" scheme); using two bits per value ("TQ2") reaches 2 bits/weight. These methods are implemented both on CPUs and GPUs for fast unpacking and high memory bandwidth utilization (Vaidhya et al., 28 Jun 2025, Huang et al., 17 Sep 2025).
Matrix-vector multiplication (GEMM): Inference kernels are redesigned to exploit the ternary structure:
- On CPUs (e.g., Apple Silicon), custom sparse GEMM kernels using blocked, interleaved storage, loop unrolling, and NEON vectorization deliver 5–6× speedup over default libraries (Lipshitz et al., 8 Oct 2025).
- On FPGAs/ASICs, accelerators such as TENET and TeLLMe use table-lookup engines and LUT-centric ternary matmuls, slashing the need for multipliers and reducing DRAM access via specialized weight packing (Huang et al., 17 Sep 2025, Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025). Dynamic N:M activation sparsity further reduces compute (Huang et al., 17 Sep 2025).
- On GPUs, TriRun offers a mixed-precision CUDA kernel (FP16 activations × INT2 weights) leveraging shared memory and pipeline parallelism, achieving up to 4.9× end-to-end throughput gains (Vaidhya et al., 28 Jun 2025).
Indexing algorithms: For fixed ternary weight matrices, block-indexed GEMV algorithms achieve $\{-1, 0, +1\}$ 8 time and memory by precomputing permutation and segmentation indices, with up to $\{-1, 0, +1\}$ 9 speedup and $W\in\mathbb{R}^{n\times d}$ 0 memory reduction in software-only settings (Dehghankar et al., 2024).

Packing Method	Bits/Weight	Packing Unit	Main Platform
2-bit ("TQ2")	2	k=256	CPU, GPU
1.6-bit ("TQ1")	1.6	k=5	CPU, FPGA, ASIC
PTQTP Trit-Plane	3.17	group=128	GPU, FPGA, ASIC

4. Empirical Scaling Laws and Model Behavior

Recent empirical analysis reveals that ternary models exhibit distinctly different scaling behavior compared to their full-precision counterparts. For ternary LLMs (TriLMs) (Vaidhya et al., 28 Jun 2025): $W\in\mathbb{R}^{n\times d}$ 1 where $W\in\mathbb{R}^{n\times d}$ 2 is parameters (M), $W\in\mathbb{R}^{n\times d}$ 3 pretraining tokens (B). The data exponent ( $W\in\mathbb{R}^{n\times d}$ 4) dominates the parameter exponent ( $W\in\mathbb{R}^{n\times d}$ 5), implying that expanding the dataset, rather than the model size, yields greater returns for ternary LLMs at fixed FLOPs.

For FloatLMs, the exponents are nearly matched ( $W\in\mathbb{R}^{n\times d}$ 6, $W\in\mathbb{R}^{n\times d}$ 7).

A practical implication is that TernaryLLMs should allocate training computation towards increasing data rather than model width/depth, diverging from established scaling rules for float-precision models.

5. Accuracy-Complexity Tradeoffs and Experimental Results

Comprehensive benchmarks show that TernaryLLMs typically retain $W\in\mathbb{R}^{n\times d}$ 8 of baseline FP16 accuracy at 1.58 bits/weight, and dramatically outperform earlier binary or poorly compensated ternary/PTQ methods.

On Qwen3-14B, PTQTP achieves $W\in\mathbb{R}^{n\times d}$ 9 retention of mathematical reasoning test accuracy compared to FP16, versus $W \approx \widetilde{W} = \alpha \cdot T$ 0 for baseline 3-bit GPTQ under the same conditions (Xiao et al., 21 Sep 2025).
LLaMA-3-8B with QAT (DLT+OFF) matches or outperforms 2-bit quantization, reaching $W \approx \widetilde{W} = \alpha \cdot T$ 1 higher zero-shot accuracy than the best 2-bit method (Chen et al., 2024).
Language modeling perplexity increases <0.5 PPL for BitNet-1.58 quantization (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025).
For quantization-aware fine-tuning, LoTA-QAF enables lossless merging of ternary adapters, recovering or surpassing LoRA (16-bit) accuracy by up to $W \approx \widetilde{W} = \alpha \cdot T$ 2 on downstream tasks (Chen et al., 24 May 2025).
FPGA and ASIC accelerators using optimized ternary GEMM consistently deliver $W \approx \widetilde{W} = \alpha \cdot T$ 3 end-to-end speedup and $W \approx \widetilde{W} = \alpha \cdot T$ 4 energy efficiency over A100-class GPUs (Huang et al., 17 Sep 2025, Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025, Yin et al., 23 Feb 2025).

6. Hardware Integration and Edge Deployment

TernaryLLMs are highly amenable to deployment on resource-constrained hardware due to their uniform, low-bit arithmetic and multiplication-free operations:

Edge FPGAs: Engines such as TeLLMe and TerEffic store weights on-chip or in HBM, implement pipelined table-lookup matmul, and achieve $W \approx \widetilde{W} = \alpha \cdot T$ 5 the throughput and $W \approx \widetilde{W} = \alpha \cdot T$ 6 the efficiency of Jetson-class SoCs at equivalent or lower power (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025, Yin et al., 23 Feb 2025).
ASICs: TENET-ASIC deploys a heterogeneous architecture (Sparse Ternary LUT arrays plus FP16 attention blocks), reaching $W \approx \widetilde{W} = \alpha \cdot T$ 7 end-to-end inference speedup and $W \approx \widetilde{W} = \alpha \cdot T$ 8 higher energy efficiency than the NVIDIA A100 GPU, supported by custom 1.6-bit packing and decompression (Huang et al., 17 Sep 2025).
CPUs/GPUs: Dedicated CPU kernels and the TriRun CUDA kernel unlock prompt and decode speedups of 1.5–7.9×, with dense or sparse storage for ternary weights (Lipshitz et al., 8 Oct 2025, Vaidhya et al., 28 Jun 2025).

7. Information-Theoretic and Theoretical Advances

TernaryLLM quantization is increasingly positioned as an information-theoretically optimal representation under resource constraints.

Entropy: Log-base-three entropy yields $W \approx \widetilde{W} = \alpha \cdot T$ 9 bits/trit, realized asymptotically by 1.6-bit packing (Vaidhya et al., 28 Jun 2025, Uhlmann, 8 Aug 2025).
SZT encoding adds “signed-zero” states, recovering additional redundancy available in the unused 2-bit codeword, greatly enhancing gradient feedback for sub-threshold weights, reducing mean-squared-error in the STE, and tightening PAC–Bayes bounds (Uhlmann, 8 Aug 2025).
Convergence dynamics: Progressive trit-plane and DLT-decompositions are theoretically guaranteed to converge monotonically, with bounded scaling parameters (Xiao et al., 21 Sep 2025).

TernaryLLMs thus represent not only an engineering compromise for edge or memory-bounded deployments, but also a theoretically motivated, rigorously analyzed quantization regime.

References

PTQTP: Post-Training Quantization to Trit-Planes for LLMs (Xiao et al., 21 Sep 2025)
TernaryLLM: Ternarized LLM (Chen et al., 2024)
Accelerating Sparse Ternary GEMM for Quantized LLM inference on Apple Silicon (Lipshitz et al., 8 Oct 2025)
The Fourth State: Signed-Zero Ternary for Stable LLM Quantization (and More) (Uhlmann, 8 Aug 2025)
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning (Chen et al., 24 May 2025)
TENET: An Efficient Sparsity-Aware LUT-Centric Architecture for Ternary LLM Inference On Edge (Huang et al., 17 Sep 2025)
TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs (Qiao et al., 22 Apr 2025)
TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs (Qiao et al., 3 Oct 2025)
TerEffic: Highly Efficient Ternary LLM Inference on FPGA (Yin et al., 23 Feb 2025)
An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks (Dehghankar et al., 2024)
Spectra 1.1: Scaling Laws and Efficient Inference for Ternary LLMs (Vaidhya et al., 28 Jun 2025)