Papers
Topics
Authors
Recent
Search
2000 character limit reached

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Published 30 Mar 2026 in cs.LG and cs.CL | (2603.28430v1)

Abstract: Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$--$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

Authors (1)

Summary

  • The paper introduces IsoQuant, a novel method that employs quaternion-based SO(4) rotations to replace 3D Clifford blocks for improved hardware alignment in KV cache compression.
  • The paper demonstrates significant efficiency gains, reducing fused-multiply-add operations by up to 50% and achieving kernel-level speedups exceeding 4.5× compared to previous methods.
  • The paper maintains quantization fidelity with near-identical MSE reconstruction errors while enabling efficient SIMD packing and seamless integration into two-stage quantization pipelines.

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Motivation and Problem Setting

Key-value (KV) cache compression serves as a critical bottleneck in the deployment and inference of LLMs with extended context lengths. State-of-the-art online vector quantization pipelines leverage orthogonal feature decorrelation, transforming inputs prior to scalar quantization to optimize rate-distortion characteristics. Conventional methods such as TurboQuant employ dense orthogonal matrices, which incur O(d2)O(d^2) storage and computational complexity, severely limiting their use in latency-sensitive scenarios.

RotorQuant alleviates this cost by decomposing the rotation into blockwise 3D Clifford rotors, yielding a linear scaling in dd. However, such 3D block structures are fundamentally misaligned with typical hardware, as transformer head widths are powers of two. This misalignment leads to irregular memory access patterns, inefficient kernel executions, and limited local subspace mixing with only three degrees of rotational freedom per block.

Quaternionic SO(4) Blockwise Rotations

IsoQuant addresses these system and expressivity limitations by introducing a blockwise SO(4) rotational decorrelation based on quaternion algebra and the isoclinic decomposition of the Lie group SO(4)SO(4). Each block of four contiguous features is mapped to a quaternion in H\mathbb{H}, enabling compact, closed-form rotational transforms. The core mechanism is a "sandwich product" T(v)=qLvqRT(v) = q_L v \overline{q_R}, parameterized by pairs of unit quaternions (qL,qR)(q_L, q_R) per block. This encapsulates the full six degrees of freedom of a four-dimensional rotation, as guaranteed by the isoclinic decomposition so(4)su(2)Lsu(2)R\mathfrak{so}(4) \cong \mathfrak{su}(2)_L \oplus \mathfrak{su}(2)_R.

Two principal variants are developed:

  • IsoQuant-Full: Utilizes both left and right unit quaternion factors, maximizing mixing capacity per block and preserving full SO(4) expressivity.
  • IsoQuant-Fast: Restricts each block to a single left-sided quaternion, reducing parameterization and arithmetic cost at the expense of partial reduction in expressivity (corresponding to SO(3) action).

A lightweight 2D planar case is also described for scenarios prioritizing minimum compute over mixing quality.

Hardware and Systems Alignment

The choice of 4D blocks directly aligns with common transformer head sizes (d{128,256,512}d\in\{128,256,512\}), avoiding the irregular tails and padding necessitated by 3D block schemes. Four-wide blocks allow for efficient SIMD packing (e.g., float4), regular memory access, and natural kernel fusion for both FP16 and FP32 representations, which leads to highly efficient fused CUDA implementations.

The IsoQuant forward rotation in the 4D full setting requires only $1024$ fused-multiply-adds (FMAs) for d=128d=128—a substantial reduction compared to RotorQuant’s dd0 FMAs. The IsoQuant-Fast variant further reduces this to dd1 FMAs.

Probabilistic and Information-Theoretic Intuition

IsoQuant leverages random or learnable blockwise rotations to isotropize feature energy locally. Under Haar-random SO(4) block rotations, coordinate energies are evenly redistributed, and the marginal distribution of each coordinate becomes more concentrated near zero (as opposed to 2D arcsine concentrations near boundaries). This statistical property produces improved quantization performance, as scalar quantizers benefit from low occurrences of extreme values, thereby minimizing block quantization errors.

Empirical Assessment

Experiments were carried out on synthetic normalized vectors across 18 configurations spanning dd2, bit-widths dd3, and precisions dd4, with batch size dd5 using fused CUDA kernels.

IsoQuant-Full, IsoQuant-Fast, and the 2D variant achieve consistent kernel-level speedups over RotorQuant by factors of dd6, dd7, and dd8, respectively, maintaining near-identical MSE reconstruction errors across all settings. Maximum observed speedups exceed dd9 in low-bit and medium-width regimes. No tradeoff in quantization quality was observed; sometimes IsoQuant produced marginally lower MSE.

Module-level throughput gains were even higher, attributed to efficient prototype implementation, but such results are interpreted as implementation-dependent.

Methodological and Practical Implications

IsoQuant’s design demonstrates strong practical suitability for real-world LLM pipelines due to:

  • Arithmetic efficiency and clean vectorization: All core operations map directly onto hardware primitives with minimal boundary handling.
  • Expressivity and rate-distortion tradeoff: Full SO(4) mixing per block yields decorrelation performance commensurate with dense orthogonal transforms, but at a fraction of the computational cost.
  • Compatibility with residual correction: IsoQuant seamlessly substitutes the stage-1 decorrelation transform in two-stage quantization systems (e.g., TurboQuant, RotorQuant), maintaining compatibility with QJL-style residual estimators.

From a theoretical perspective, the quaternion-pair parameterization provides a non-redundant coverage of the SO(4) rotation group, enabling possibilities for smoothly interpolated or learned adaptive block rotations.

Limitations and Future Directions

IsoQuant, while substantially improving blockwise decorrelation, leaves global cross-block correlations unaddressed. The empirical evaluation focuses on synthetic data and does not include end-to-end LLM KV-cache validation or downstream metrics. Exploring hierarchical mixing, learned parameterizations, and coupling with stage-2 residual correction within actual model deployments remains necessary future work.

Key research questions include:

  • The impact of learned vs. random block rotations in production workloads
  • Strategies for global cross-block mixing without undermining hardware alignment
  • Integration of IsoQuant in hierarchical or recurrent quantization architectures

Conclusion

IsoQuant presents a mathematically rigorous and hardware-aligned mechanism for blockwise orthogonal feature decorrelation, replacing 3D Clifford-based rotations with SO(4) quaternion constructions tailored to transformer model architectures. Empirical evidence demonstrates substantial gains in computational efficiency—up to SO(4)SO(4)0 over RotorQuant—without sacrificing quantization fidelity. The design is directly compatible with two-stage quantization pipelines and future extensions that incorporate residual correction and more expressive mixing. Consequently, IsoQuant offers a compelling new primitive for scalable, high-throughput LLM KV-cache compression with strong theoretical and system-level motivation.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.