AXELRAM: Efficient Quantized Attention via Asymmetric SRAM Architecture
Introduction and Motivation
AXELRAM proposes a SRAM-integrated architecture for attention computation that fundamentally restructures the interaction between quantization and attention lookup for key-value (KV) cache acceleration in LLMs. At the core is the elimination of per-token dequantization; instead, all operations post-write are performed in the transform domain, enabled by a design-time fixed codebook with a single orthogonal rotation per query. The architecture is motivated by the prohibitive cost of per-query dequantization in context-rich inference, especially for large parameter models at long sequences, and addresses the observation that prior state-of-the-art approaches such as TurboQuant are bottlenecked by unnecessary recomputation and dequantization redundancy.
(Figure 1)
Figure 1: AXELRAM eliminates per-token dequantization and inverse rotations by integrating quantization-aware lookup directly in SRAM, leveraging orthogonal invariance to shift all expensive computation to a one-time per-query rotation.
Principle of Fixed-Codebook Quantization
A key insight is that the concatenation of a random Hadamard rotation (FWHT) and subsequent Lloyd-Max scalar quantization for each coordinate of the activation vector yields a distribution tightly matching $\mathcal{N}(0, 1/d)$ under high-dimensionality ($d \geq 64$). Consequently, the optimal codebook for quantization depends only on $d$ and the bitwidth $b$, and not on the underlying data distribution or model weights. This universality enables precomputing the quantizer offline and embedding it in a 30-byte ROM, shared across all layers and heads.
Because orthogonal rotations preserve inner products, the architecture can rotate queries once at runtime and compute all relevant inner products to the stored (already rotated and quantized) keys via lookups in the transform domain. This design removes the need for $T$ per-query inverse transforms and dequantization, reducing the total number of multiplications by a mathematically exact factor, independent of $T$:
$$
\text{Multiplications: } d \times 2b + T \ll T \times d \text{ (for typical } b \leq 4)
$$
Enabling the core "quantize once, never dequantize" paradigm, this asymmetry unlocks not only hardware simplicity but also substantial inference acceleration.
Hardware Macro Architecture
AXELRAM's architecture consists of an asymmetric path in SRAM:
- Write Path (Encode): Each incoming key is normalized, rotated by FWHT with a layer-specific sign pattern, quantized to indices using the fixed codebook, and its norm and quantization indices are written to SRAM. This process is multiplier-free apart from the norm computation.
- Read Path (Attention Score Computation): At inference, for each query, the write-path rotation is applied once to the query, and a lookup table for all codebook centroid products is computed. Each attention score for stored keys is then formed by gathering the quantized centroids, summing, and rescaling by the original key normโrequiring only a single multiplication per key.
(Figure 2)
Figure 2: The SRAM macro contains a FWHT circuit, comparator-based quantization, ROM-based codebook and sign pattern, and a lookup-adder pipeline; this architecture dispenses with dequantization logic entirely.
(Figure 3)
Figure 3: Detailed depiction of the read path showing query transform, codebook table generation, parallel table lookup, addition, and norm scaling, with explicit multiplication count highlighting the computational savings.
Concretely, for $d = 128$ and $b = 3$ (3 bits per codebook index), AXELRAM achieves over $100\times$ reduction in per-query multiplications relative to conventional approaches.
Sign Pattern Sensitivity and Catastrophic Failures
Contrary to the previously accepted view that random sign patterns in the Hadamard rotation are innocuous, empirical results using multi-seed evaluation across LLaMA-3.1-8B, Qwen2.5-3B, and Qwen3-8B demonstrate sharp dependence of perplexity on the chosen pattern. Particularly, Qwen2.5-3B exhibits catastrophic perplexity spikes ($\Delta \text{PPL} > 50$) for certain seeds, a qualitatively more severe manifestation compared to prior reports of up to 6pt accuracy variance in weight quantization (SpinQuant).
The underlying cause is traced to excessive heterogeneity in channel-wise norms, especially in the lower layers of models like Qwen2.5-3B, which violate the uniform-distribution assumption required for codebook optimality. Some sign patterns can concentrate most of the variance into a smaller set of coordinates, making quantization resolution per coordinate insufficient.
Implications: Single-seed results, prior best-practice, can severely underestimate worst-case failure in low-bitwidth KV cache quantization. This discovery necessitates running multiple seeds and motivates robust mitigation strategies.
Gradient-Free Sign Pattern Optimization
To resolve sign pattern sensitivity, a calibration-based, gradient-free sign selection procedure is proposed. By sampling $C=200$ random sign patterns per layer and selecting the one minimizing quantization MSE on a small calibration subset (8 samples), catastrophic PPL spikes are eliminated. This optimization is computationally cheap (โ1s per layer), requires no additional hardware, and the resulting sign vectors occupy the same ROM module as before.
As measured empirically, optimized sign patterns reduce worst-case $\Delta$PPL from +58.43 to +0.82 (2-bit) and from +51.00 to +0.58 (3-bit) on Qwen2.5-3Bโa 99% reduction. In LLaMA-3.1-8B, which exhibits norm homogeneity, optimization has no effect, confirming the methodโs selectivity.
Theoretical and Practical Implications
- Bias-Variance Tradeoff with QJL: The analysis also demonstrates that QJL-based unbiased estimators, as used in TurboQuant, are counterproductive in the low-bit regime ($b \leq 3$), as the variance penalty from reduced MSE codebook capacity exceeds any reduction in estimator bias, especially once the attention softmax is applied. This finding clarifies a recurring source of performance degradation overlooked in earlier work.
- Hardware Simplicity and Generalization: The hardware macro remains unchanged regardless of whether default or optimized sign vectors are used. Calibration-free execution is safe for homogeneous models (low norm ratio); for heterogeneous models, lightweight, one-time calibration aligns the sign vectors to the specific channel statistics.
- Model Selection Guidelines: The layer-wise key norm ratio ($\max/\min$) accurately predicts sign pattern sensitivity, and models with ratios above $5\times$ are flagged for sign calibration. For many architectures, 3-bit quantization is reliable; 2-bit remains challenging in certain cases due to inherent quantization error, not attributable to sign pattern pathology.
Potential for Future AI and Hardware Systems
AXELRAM generalizes the product quantization paradigm, but replaces data-dependent codebook adaptation with a universal, mathematically optimal configuration, streamlining hardware deployment for LLM inference. The architecture paves the way for SRAM-integrated, per-query-optimized attention pipelines in low-power, high-throughput environments, including edge and near-memory inference accelerators.
Adoption of multi-seed and sign-calibration methodologies will likely become industry-standard, as model diversity grows and quantization error-tolerance shrinks in deployment scenarios. Hardware-provisioned storage for sign patterns supports fast reconfiguration as new models emerge. Future directions include integrated support for higher-rank transformations, systematic tradeoff studies for alternative bit allocations, and silicon validation with on-chip power and throughput characterization.
Conclusion
AXELRAM introduces an SRAM-based asymmetric attention computation architecture that removes quantized KV dequantization from the inference path, yielding an exact $102.4\times$ reduction in multiplication cost at typical operating regimes. The work exposes the previously underappreciated risk of catastrophic sign pattern sensitivity in rotation-based cache quantization and provides a robust, zero-cost, gradient-free calibration algorithm. All design principles and simulation code are released for reproducibility, setting new guidelines for hardware-software co-design of efficient Transformer inference.