- The paper introduces IsoQuant, a novel method that employs quaternion-based SO(4) rotations to replace 3D Clifford blocks for improved hardware alignment in KV cache compression.
- The paper demonstrates significant efficiency gains, reducing fused-multiply-add operations by up to 50% and achieving kernel-level speedups exceeding 4.5× compared to previous methods.
- The paper maintains quantization fidelity with near-identical MSE reconstruction errors while enabling efficient SIMD packing and seamless integration into two-stage quantization pipelines.
IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
Motivation and Problem Setting
Key-value (KV) cache compression serves as a critical bottleneck in the deployment and inference of LLMs with extended context lengths. State-of-the-art online vector quantization pipelines leverage orthogonal feature decorrelation, transforming inputs prior to scalar quantization to optimize rate-distortion characteristics. Conventional methods such as TurboQuant employ dense orthogonal matrices, which incur O(d2) storage and computational complexity, severely limiting their use in latency-sensitive scenarios.
RotorQuant alleviates this cost by decomposing the rotation into blockwise 3D Clifford rotors, yielding a linear scaling in d. However, such 3D block structures are fundamentally misaligned with typical hardware, as transformer head widths are powers of two. This misalignment leads to irregular memory access patterns, inefficient kernel executions, and limited local subspace mixing with only three degrees of rotational freedom per block.
Quaternionic SO(4) Blockwise Rotations
IsoQuant addresses these system and expressivity limitations by introducing a blockwise SO(4) rotational decorrelation based on quaternion algebra and the isoclinic decomposition of the Lie group SO(4). Each block of four contiguous features is mapped to a quaternion in H, enabling compact, closed-form rotational transforms. The core mechanism is a "sandwich product" T(v)=qLvqR, parameterized by pairs of unit quaternions (qL,qR) per block. This encapsulates the full six degrees of freedom of a four-dimensional rotation, as guaranteed by the isoclinic decomposition so(4)≅su(2)L⊕su(2)R.
Two principal variants are developed:
- IsoQuant-Full: Utilizes both left and right unit quaternion factors, maximizing mixing capacity per block and preserving full SO(4) expressivity.
- IsoQuant-Fast: Restricts each block to a single left-sided quaternion, reducing parameterization and arithmetic cost at the expense of partial reduction in expressivity (corresponding to SO(3) action).
A lightweight 2D planar case is also described for scenarios prioritizing minimum compute over mixing quality.
Hardware and Systems Alignment
The choice of 4D blocks directly aligns with common transformer head sizes (d∈{128,256,512}), avoiding the irregular tails and padding necessitated by 3D block schemes. Four-wide blocks allow for efficient SIMD packing (e.g., float4), regular memory access, and natural kernel fusion for both FP16 and FP32 representations, which leads to highly efficient fused CUDA implementations.
The IsoQuant forward rotation in the 4D full setting requires only $1024$ fused-multiply-adds (FMAs) for d=128—a substantial reduction compared to RotorQuant’s d0 FMAs. The IsoQuant-Fast variant further reduces this to d1 FMAs.
IsoQuant leverages random or learnable blockwise rotations to isotropize feature energy locally. Under Haar-random SO(4) block rotations, coordinate energies are evenly redistributed, and the marginal distribution of each coordinate becomes more concentrated near zero (as opposed to 2D arcsine concentrations near boundaries). This statistical property produces improved quantization performance, as scalar quantizers benefit from low occurrences of extreme values, thereby minimizing block quantization errors.
Empirical Assessment
Experiments were carried out on synthetic normalized vectors across 18 configurations spanning d2, bit-widths d3, and precisions d4, with batch size d5 using fused CUDA kernels.
IsoQuant-Full, IsoQuant-Fast, and the 2D variant achieve consistent kernel-level speedups over RotorQuant by factors of d6, d7, and d8, respectively, maintaining near-identical MSE reconstruction errors across all settings. Maximum observed speedups exceed d9 in low-bit and medium-width regimes. No tradeoff in quantization quality was observed; sometimes IsoQuant produced marginally lower MSE.
Module-level throughput gains were even higher, attributed to efficient prototype implementation, but such results are interpreted as implementation-dependent.
Methodological and Practical Implications
IsoQuant’s design demonstrates strong practical suitability for real-world LLM pipelines due to:
- Arithmetic efficiency and clean vectorization: All core operations map directly onto hardware primitives with minimal boundary handling.
- Expressivity and rate-distortion tradeoff: Full SO(4) mixing per block yields decorrelation performance commensurate with dense orthogonal transforms, but at a fraction of the computational cost.
- Compatibility with residual correction: IsoQuant seamlessly substitutes the stage-1 decorrelation transform in two-stage quantization systems (e.g., TurboQuant, RotorQuant), maintaining compatibility with QJL-style residual estimators.
From a theoretical perspective, the quaternion-pair parameterization provides a non-redundant coverage of the SO(4) rotation group, enabling possibilities for smoothly interpolated or learned adaptive block rotations.
Limitations and Future Directions
IsoQuant, while substantially improving blockwise decorrelation, leaves global cross-block correlations unaddressed. The empirical evaluation focuses on synthetic data and does not include end-to-end LLM KV-cache validation or downstream metrics. Exploring hierarchical mixing, learned parameterizations, and coupling with stage-2 residual correction within actual model deployments remains necessary future work.
Key research questions include:
- The impact of learned vs. random block rotations in production workloads
- Strategies for global cross-block mixing without undermining hardware alignment
- Integration of IsoQuant in hierarchical or recurrent quantization architectures
Conclusion
IsoQuant presents a mathematically rigorous and hardware-aligned mechanism for blockwise orthogonal feature decorrelation, replacing 3D Clifford-based rotations with SO(4) quaternion constructions tailored to transformer model architectures. Empirical evidence demonstrates substantial gains in computational efficiency—up to SO(4)0 over RotorQuant—without sacrificing quantization fidelity. The design is directly compatible with two-stage quantization pipelines and future extensions that incorporate residual correction and more expressive mixing. Consequently, IsoQuant offers a compelling new primitive for scalable, high-throughput LLM KV-cache compression with strong theoretical and system-level motivation.