PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
Abstract: We present PolarQuant, a post-training weight quantization method for LLMs that exploits the distributional structure of neural network weights to achieve near-lossless compression. PolarQuant operates in three stages: (1) block-wise normalization to the unit hypersphere, (2) Walsh-Hadamard rotation to transform coordinates into approximately Gaussian random variables, and (3) quantization with centroids matched to the Gaussian distribution. Our ablation reveals that Hadamard rotation alone accounts for 98% of the quality improvement, reducing Qwen3.5-9B perplexity from 6.90 (absmax Q5) to 6.40 (Delta = +0.03 from FP16), making it practically lossless without any calibration data. Furthermore, PolarQuant functions as an effective preprocessing step for downstream INT4 quantizers: PolarQuant Q5 dequantized and re-quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4, while maintaining 43.1 tok/s throughput at 6.5 GB VRAM. Code and models are publicly available.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview in Simple Terms
This paper introduces PolarQuant, a way to shrink LLMs so they fit on smaller computers without hurting their performance much. It focuses on compressing the model’s weights (the numbers the model learns) after training, so the model runs fast and uses less memory but still gives good answers.
What Questions Did the Paper Try to Answer?
In plain language, the paper asks:
- Can we shrink a big LLM’s weights to 4–5 bits per number (very small) while keeping its quality almost the same as the original?
- Is there a simple, fast method to do this that doesn’t need extra training data?
- Can this method also make other popular 4-bit methods work better?
How Did They Do It? (Methods Explained)
To understand the method, imagine you have a long list of numbers (the model’s weights) you want to store using fewer digits:
- Split into chunks:
- They chop the long list of weights into small blocks (like groups of 128 numbers).
- Think of each block as a bundle of values that will be handled together.
- Make all blocks “the same size”:
- Each block is scaled so its overall size (length) becomes 1. This is like stretching or shrinking each block so they’re all comparable.
- Mix the numbers evenly with a Hadamard rotation:
- A “Hadamard rotation” is a special, super-fast way to mix the numbers in a block so that the energy (or information) is spread out more evenly.
- Analogy: Imagine you have a few huge values and many tiny ones. After this mixing, the big values get spread out, so there are fewer “outliers.” The numbers start to follow a bell curve (a “Gaussian” or normal distribution), which is easier to compress well.
- Round smartly using a Gaussian-aware codebook:
- “Quantization” means rounding numbers to a small set of allowed values.
- Because the mixed numbers now follow a bell-curve shape, the best places to put these allowed values are known. The paper uses Lloyd–Max centroids, which is just a fancy way of saying “place the rounding points where they reduce error the most for a bell-curve.”
- Then they store the small integer codes plus one short number per block (the block’s original size) so they can reverse the process later.
Reversing the process (to run the model):
- Look up the stored code to get a value, undo the mixing (the Hadamard is its own inverse), and scale back to the original size. This adds only a few seconds at load time and has no slowdown during use.
Key ideas in everyday terms:
- Quantization = rounding to a small set of levels.
- Hadamard rotation = super-fast “blender” that spreads out big spikes.
- Gaussian/bell curve = the most common, well-understood distribution, which lets us choose the smartest rounding levels.
Main Results and Why They Matter
Here are the highlights, explained:
- Near-lossless quality at 5 bits (Q5):
- On a 9B-parameter model (Qwen3.5-9B), PolarQuant’s 5-bit version achieves a perplexity of about 6.39, which is only +0.02 worse than full precision (6.37). In practice, that’s almost no quality loss.
- Perplexity is a standard score for LLMs; lower is better. A tiny increase means the model is still very accurate.
- The rotation is the main hero:
- The Hadamard rotation (the “mixing” step) does about 98% of the work. Just doing the rotation and then normal rounding already gets very close to full precision.
- The fancy optimal rounding (Lloyd–Max) helps a little more, but not much at 5 bits because there are already enough rounding levels.
- Works as a booster for popular 4-bit methods:
- If you first apply PolarQuant Q5, then dequantize and re-quantize to standard INT4 (a common 4-bit format), you get better quality than going directly to INT4.
- Example: PolarQuant Q5 then INT4 gives perplexity 6.56 vs 6.68 for direct INT4, with the same speed and memory.
- Speed stays high: about 43 tokens/second on a powerful NVIDIA GPU at around 6.5 GB VRAM.
- Runs on laptops too:
- On a Mac mini (Apple Silicon), a 9B model can run at about 20 tokens/second using around 4.8 GB memory in a 4-bit setup. That’s impressive for small devices.
Why this matters:
- It lets big models fit on consumer GPUs and even some laptops, with little to no loss in quality.
- It’s simple, fast, and doesn’t need extra training data.
Why It Works (Intuition)
- Before: Model weights often have outliers (a few very large values). Simple “absmax” quantization wastes precision on these rare extremes.
- After the Hadamard rotation: The big values get spread out. The numbers look like a bell curve with fewer outliers. That shape is perfect for efficient quantization.
- Because the numbers are now well-behaved, even simple 4–5 bit rounding does a great job.
Implications and Possible Impact
- Easier deployment: Developers can run large models on cheaper hardware without big quality drops.
- Plug-and-play: PolarQuant can be used as a quick preprocessing step and works with other tools like INT4 quantizers.
- Practical benefits: No need for calibration data. Minimal load-time cost, zero runtime slowdown.
Limitations and future ideas:
- The method assumes the rotated numbers look like a bell curve. That’s true in tests here, but might vary across models.
- It currently treats blocks separately; future work might use smarter strategies across blocks or try vector quantization.
- It may also help with activations or caches, not just weights.
In short: PolarQuant shows that a simple, fast “mix-then-round-smartly” approach can shrink big LLMs to fit small devices while keeping their quality almost unchanged.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concrete, actionable list of what remains uncertain or unexplored in the paper, to guide future research.
- Generalization across models: Validate PolarQuant on diverse architectures and scales (e.g., Llama/Mistral/GPT-NeoX, 7B–70B+, dense vs. MoE), and across layer types (embeddings, LM head, norms, biases, routing/gating).
- Task coverage beyond perplexity: Assess zero-shot and reasoning benchmarks (e.g., MMLU, GSM8K, BIG-bench, HELM suites) and generation quality (toxicity, factuality, long-context) to confirm that perplexity gains translate to task performance.
- Lower-bit regimes: Systematically evaluate Q2–Q4 settings to quantify when Lloyd–Max centroids matter most, and clarify the performance cliff observed with Q3 preprocessing before INT4 re-quantization.
- Theoretical guarantees at finite block sizes: Provide non-asymptotic bounds on Gaussianity and error for finite d (e.g., d=64/128), rather than relying on asymptotic spherical CLT arguments.
- Distributional fit diagnostics: Report per-layer and per-block goodness-of-fit (e.g., KS/CvM statistics, QQ plots) of rotated coordinates to , and identify layers where Gaussianity breaks down.
- Block size sensitivity: Explore how block size (d=32/64/128/256) affects Gaussian approximation, compression efficacy, norm overhead, and speed; include non-power-of-two cases and padding strategies.
- Handling small/irregular tensors: Specify and evaluate procedures for layers with dimensions < d or shapes not divisible by d (padding, smaller transforms, or fallback paths) and their impact on quality.
- Alternative rotations: Compare Hadamard to other orthogonal transforms (random orthogonal matrices, Rademacher-diagonal–Hadamard–RHT, DCT, learned rotations/SpinQuant-style) for quality vs. cost trade-offs.
- Inter-block correlations: Investigate whether modeling correlations across blocks (e.g., vector/product quantization over concatenated blocks or learned multi-block transforms) yields additional gains.
- Direct INT4/4-bit inference without re-quantization: Develop operators/kernels to run directly on PolarQuant’s codebooks (e.g., table lookups with fused WHT) to avoid the dequantize-to-BF16 then re-quantize step.
- Cascaded quantization theory: Formalize when and why an upstream Q5 “denoising” stage improves downstream INT4 (group-wise absmax) and derive conditions for optimal intermediate bit-width in cascades.
- Group-size dependence in downstream INT4: Evaluate how torchao’s group size (e.g., 32/64/128/256) interacts with PolarQuant preprocessing and quantify the joint optimum.
- AWQ interplay and order: Systematically study ordering (AWQ→PolarQuant vs. PolarQuant→AWQ), per-layer scaling choices, calibration set size/selection sensitivity, and why PolarQuant+AWQ sometimes underperforms uniform Q5.
- Layerwise sensitivity and mixed-bit allocation: Provide a principled, data-driven mixed-bit policy post-rotation (including for MoE gates, attention projections, and output heads), beyond the anecdotal mixed-bit findings.
- Robustness to non-Gaussian/heavy-tailed layers: Identify layers where Hadamard rotation does not Gaussianize well (e.g., structured outliers), and test robust or mixture-of-Gaussians codebooks for those cases.
- Lloyd–Max quantizer details: Report full centroid tables and MSE improvements at b=4 and b=5, with reproducible scripts; clarify the generality of the stated 54% MSE reduction (given explicitly only for b=3).
- Normalization ablations: Isolate the effect of block normalization vs. rotation alone; quantify how each step contributes to error reduction across bits and layers.
- Numerical and implementation details: Address dequantization load-time inconsistency (4 s vs. ~8 s) and profile across GPUs/CPUs; provide kernel-level benchmarks for FWHT vs. GEMM-based Hadamard on different hardware.
- Precision and storage trade-offs: Explore storing per-block norms in lower precision (e.g., FP8/INT16) and its effect on accuracy; assess metadata overhead at small block sizes and for small layers.
- Memory/runtime on constrained devices: Quantify end-to-end latency, peak memory, and energy impact on edge hardware beyond a single Apple Silicon device, including mobile NPUs if feasible.
- Long-context and streaming effects: Evaluate stability and error propagation in long-sequence inference and streaming generation, where small weight errors may accumulate differently.
- Compatibility with finetuning/QAT/LoRA: Test whether PolarQuant (and its rotation) supports post-quantization finetuning, adapters (LoRA/LoRA+), or quantization-aware training without degrading gains.
- Safety and calibration drift: Examine whether rotation/quantization alters safety behaviors or calibration (e.g., confidence distributions), particularly on instruction-tuned models.
- KV-cache and activation quantization: Extend and evaluate the method for activations and KV caches within the same framework to quantify whole-pipeline memory/perf benefits.
- Determinism and reproducibility: Provide seeds, exact libraries/versions, and scripts to reproduce all reported results (some NF4 values are marked “~”); ensure fair, apples-to-apples baselines.
- Sensitivity to block boundary placement: Investigate whether different block partitionings (e.g., aligned to channels/rows/columns) affect performance and whether learned/blockwise permutations help.
- Hardware-intrinsics support: Explore integer-only or LUT-friendly implementations of Gaussian centroids on common accelerators (TensorCores/AMX/ANE), and the impact on throughput if centroids are non-uniform.
- Security/robustness edge cases: Test behavior under adversarial or distribution-shift scenarios where weight statistics differ (e.g., after domain-specific finetunes) to see if Gaussianity assumptions persist.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that leverage PolarQuant’s findings (block-wise normalization, Hadamard rotation, Gaussian-matched quantization) with known performance characteristics (e.g., near-lossless Q5, improved downstream INT4, zero runtime overhead, ~4–8s load-time dequantization).
- Edge/On‑Device LLM Inference on Consumer Hardware (software, education, healthcare, robotics)
- Use case: Run 7–9B LLMs on laptops/desktops with 6–8 GB of free memory (e.g., Mac mini M4 achieves 19.7 tok/s at ~4.8 GB with Q4; consumer NVIDIA GPUs run INT4 at ~6.5 GB).
- Impact: Private/offline assistants, classroom tutors, secure note/summary generation in clinics, and robot command interpretation without cloud reliance.
- Potential tools/workflows:
- A “PolarQuant packager” CLI that converts FP16 checkpoints into PQ‑Q5 or PQ‑Q4 artifacts and exports loaders for PyTorch/torchao and MLX.
- Hugging Face model cards with pre‑quantized PolarQuant weights and one‑click loaders.
- Assumptions/dependencies: Device must support a short load-time dequant (Hadamard + table lookup); results validated on Qwen3.5‑9B; performance may vary on other architectures and lower VRAM devices.
- Cost‑Neutral Quality Gains for Cloud INT4 Serving (software, finance, enterprise IT)
- Use case: Preprocess weights with PolarQuant Q5 and then re‑quantize to INT4 (e.g., torchao) for inference; improves perplexity (e.g., 6.56 vs 6.68) at the same speed and memory.
- Impact: Better response quality within existing INT4 memory/latency budgets; potentially higher model acceptance for customer‑facing or compliance‑critical use.
- Potential tools/workflows:
- CI/CD step that automatically produces “PQ‑Q5‑>INT4” artifacts during model release.
- A serving-side loader that prioritizes PQ‑preprocessed weights when deploying INT4 endpoints.
- Assumptions/dependencies: Benefits were shown with torchao’s group‑wise absmax INT4; other INT4 kernels may yield different gains; downstream INT4 recalibration should be consistent with PQ preprocessing.
- Near‑Lossless Compressed Model Distribution (software/MLOps)
- Use case: Ship PQ‑Q5 as a storage/distribution format (5.1–5.2 bpw + tiny overhead) and dequantize to FP16 on load (~4–8s); observed ΔPPL ~+0.02 from FP16 on Qwen3.5‑9B.
- Impact: Smaller artifacts, lower bandwidth costs, faster replication across regions, while preserving FP16 runtime speed and behavior.
- Potential tools/workflows:
- Artifact registries storing PQ‑Q5 weights and auto‑dequantizing on first load (with caching).
- Integration into GGUF/ggml‑style ecosystems to add a “PQ‑Q5” weight format.
- Assumptions/dependencies: Dequantization requires a one‑time load-time transform; minimal storage overhead per 128‑element block (fp16 norm).
- Privacy‑Preserving On‑Prem/On‑Device AI (healthcare, finance, government)
- Use case: Deploy 9B‑scale LLMs locally to keep PHI/PII on device for summarization, triage, or report generation, leveraging high‑quality INT4/PQ‑Q5 inference.
- Impact: Reduces regulatory exposure and network dependency; supports data residency and auditability.
- Potential tools/workflows:
- “Secure quantization pipelines” that produce cryptographically signed PQ artifacts and maintain lineage for audits.
- Assumptions/dependencies: Domain‑specific validation is needed to confirm that minor quantization deltas do not affect clinical/financial decision quality.
- Robotics & Embedded Systems with Tight Memory Budgets (robotics, manufacturing)
- Use case: Onboard LLM components for instruction following, task planning, or dialog on devices with limited VRAM.
- Impact: Fewer offboard calls; lower latency; improved robustness in constrained networks.
- Potential tools/workflows:
- ROS-compatible nodes that load PQ‑compressed weights and expose lightweight NLP services.
- Assumptions/dependencies: Compute budget must tolerate occasional load‑time overhead; task performance should be validated under quantized conditions; VRAM must meet ~5–6 GB thresholds for 9B models (or smaller models for tighter budgets).
- Academic Baselines and Ablations (academia)
- Use case: Reproducible, calibration‑free baseline for weight quantization that isolates the effect of rotation vs. centroid placement (Hadamard explains ~98% of gains at Q5).
- Impact: Cleaner ablation studies; fairer comparisons across quantizers; fast prototyping of cascade quantization ideas.
- Potential tools/workflows:
- Open-source notebooks to replicate paper ablations and extend to other architectures.
- Assumptions/dependencies: Results currently centered on Qwen3.5‑9B; generalization to diverse architectures should be empirically confirmed.
- “Quantization‑as‑a‑Switch” in Serving Frameworks (software/platforms)
- Use case: Toggle PQ preprocessing and INT4 re‑quantization via a config flag in PyTorch AO, vLLM, MLX, or Triton pipelines.
- Impact: Rapid A/B testing of quality/perf trade‑offs; safer rollouts.
- Potential tools/workflows:
- A runtime flag (e.g., --enable-polar-quant) with automatic integration into model loaders.
- Assumptions/dependencies: Requires light engineering effort in serving stacks to include Hadamard transforms and centroid lookups at load.
- Energy/Cost Efficiency Improvements (energy, operations)
- Use case: Replace FP16 with INT4 (improved by PQ preprocessing) where quality thresholds allow; in edge devices, use PQ‑Q4/Q5 for sustained battery and thermal benefits.
- Impact: Lower energy per token; higher consolidation per GPU in inference clusters; cheaper edge deployments.
- Potential tools/workflows:
- Metering dashboards that report energy/token alongside PPL deltas when enabling PQ pipelines.
- Assumptions/dependencies: Energy gains depend on INT4 kernels/hardware; quality‑energy thresholds must be set per application.
- Curriculum and Training Tools (education)
- Use case: On‑device student assistants/tutors that work offline in classrooms with constrained resources.
- Impact: Low‑cost access to AI assistance without cloud dependence.
- Potential tools/workflows:
- School IT deployment kits with pre‑quantized PQ models and simplified installers for Mac/Windows.
- Assumptions/dependencies: Content safety and curriculum alignment must be validated; device specs must meet VRAM and CPU/GPU requirements.
Long‑Term Applications
These require further research, standardization, or scaling beyond the paper’s scope.
- Hardware/Compiler Co‑Design with Fused Hadamard Ops (semiconductors, systems software)
- Vision: Add fast Walsh–Hadamard (and inverse) primitives to inference hardware and compilers, enabling near‑zero‑overhead PQ loading and potentially in‑memory transforms.
- Potential products:
- GPU/TPU kernels and compiler passes that fuse Hadamard with dequantization steps.
- Dependencies: Vendor adoption; benchmarking across diverse workloads; ISA/compiler support.
- Standardized “Gaussian‑Rotated Quantization” Formats (software, model hubs)
- Vision: A community standard (e.g., PQ5) in GGUF/SAFETensors for Gaussian‑matched codebooks + block norms + rotation metadata.
- Potential products:
- Model hub validators and metadata schemas; compatibility layers in vLLM/Transformers/ggml.
- Dependencies: Consensus on metadata and block sizes; broad testing across model families (Mistral/Llama/MoE).
- End‑to‑End Training with Rotation‑Aware or Learned Rotations (academia, software)
- Vision: Integrate rotation (fixed or learned) into training or post‑training fine‑tuning to further stabilize distributions and enable more aggressive low‑bit quantization (e.g., Q3/Q2).
- Potential products:
- Rotation‑aware training recipes; plugins that learn per‑layer rotations (cf. SpinQuant) with PQ‑friendly constraints.
- Dependencies: Training compute and data; validation that benefits persist across tasks; robust graph surgery tools.
- Activation and KV Cache Quantization via the Same Framework (software, systems)
- Vision: Extend polar quantization to activations/KV caches for end‑to‑end memory reduction during long‑context inference.
- Potential products:
- Unified “polar quantizer” for weights + KV that can tune bit‑allocation policies across components.
- Dependencies: Stability under dynamic workloads; latency constraints for online transforms; integration with caching layers.
- Inter‑Block Correlation Exploitation and Vector Quantization (academia)
- Vision: Go beyond independent blocks by modeling inter‑block structure; explore vector quantization with Gaussian codebooks to further reduce distortion at low bits.
- Potential products:
- Research prototypes and libraries with adaptive block sizes and vector codebooks.
- Dependencies: Complexity/benefit trade‑offs; hardware‑friendly implementations.
- Cascaded Quantization Design Patterns and Auto‑Tuning (software/MLOps)
- Vision: Generalize the “PQ‑Q5 -> INT4” principle into a library that auto‑selects intermediate bit widths and quantizers to maximize downstream quality at fixed budgets.
- Potential products:
- Auto‑tuning services that benchmark cascade options and emit deployment‑ready artifacts.
- Dependencies: Benchmark suites and KPIs per domain; standard APIs for pluggable quantizers.
- Multimodal and Cross‑Domain Compression (healthcare imaging, robotics perception, media)
- Vision: Apply rotation + distribution‑matched quantization to vision backbones, audio, and multimodal encoders/decoders.
- Potential products:
- Compressed multimodal foundation models for edge devices (e.g., hospital rooms, robots, vehicles).
- Dependencies: Empirical validation of Gaussianity after rotation in non‑text modalities; task‑level metrics.
- Policy and Sustainability Frameworks (policy, energy)
- Vision: Include quantization‑efficiency disclosures in model cards; guidelines that encourage deployment of compressed models where feasible.
- Potential products:
- “Green AI” scorecards incorporating bpw, energy/token, and quality deltas for audit and procurement.
- Dependencies: Community standards; robust measurement methodologies; stakeholder buy‑in.
- Consumer‑Grade Fully Offline Assistants at Larger Scales (consumer devices)
- Vision: Push 9–13B+ LLMs into laptops/tablets/phones with near‑chat quality via stronger rotation‑aware INT4/INT3 methods.
- Potential products:
- Device OEM integrations with PQ‑enabled on‑device assistants preloaded.
- Dependencies: Further bit‑reductions without quality loss; efficient kernels on mobile/ARM; thermal constraints.
- Security and Model Handling (enterprise IT)
- Vision: Secure transport and at‑rest encryption for PQ‑compressed weights; smaller artifacts reduce attack surface during distribution/updates.
- Potential products:
- Enterprise model vaults with PQ support, provenance tracking, and integrity checks.
- Dependencies: Integration with existing security tooling; standardized PQ metadata; organizational processes.
Cross‑Cutting Assumptions and Dependencies
- Distributional assumption: The effectiveness of Hadamard rotation hinges on weight blocks approximating Gaussian after normalization and rotation; while strongly supported at d=128 in the paper, some architectures may deviate.
- Scope of evidence: Main results are on Qwen3.5‑9B; replication on broader families (LLama, Mistral, MoE variants) is advisable for production rollouts.
- Downstream quantizer dependency: Documented INT4 gains use torchao’s group‑wise absmax; results can vary with different group sizes or kernels.
- Bit‑width selection in cascades: Intermediate Q5 preserves enough signal for downstream INT4; Q3 may be too lossy (paper’s “double quantization paradox”).
- Operational constraints: A one‑time 4–8s dequantization at load; zero runtime overhead. Pipelines must tolerate this during warm‑up.
- Safety and compliance: In regulated domains (healthcare, finance), task‑specific validation is required to confirm quantization does not degrade safety‑critical behavior.
Glossary
- Absmax quantization: A linear quantization scheme that maps values using the absolute maximum as scale, assuming a uniform range and often misallocating levels to tails. "The simplest and most widely deployed quantization scheme is absmax (absolute maximum) quantization~\cite{jacob2018quantization}, which linearly maps values in to integer codes, where ."
- Activation-Aware Weight Quantization (AWQ): A method that protects influential channels by scaling based on activation statistics from calibration data. "AWQ: Activation-Aware Weight Quantization"
- argmin: The argument (value) that minimizes a given function; used to pick the nearest centroid during quantization. "Quantize each element to the nearest Lloyd--Max centroid ."
- BF16: A 16-bit floating-point format (bfloat16) used as an intermediate precision for efficiency. "$W \xrightarrow{\text{PolarQuant Q5} \hat{W}_{\text{PQ} \xrightarrow{\text{dequant BF16} \hat{W}_{\text{BF16} \xrightarrow{\text{torchao INT4} \hat{W}_{\text{INT4}.$"
- Codebook: The set of quantization levels (centroids) used to represent values; entries can be wasted on rare outliers under mismatched assumptions. "it wastes precious codebook entries on rarely occurring outlier magnitudes"
- Dequantization: The process of reconstructing real-valued weights from quantized codes using stored scales and centroids. "Dequantization is the exact inverse: look up centroids from codes, scale by , apply inverse Hadamard rotation (since ), and scale by the stored norm ."
- Fast Walsh--Hadamard transform (FWHT): An O(d log d) algorithm to apply the Hadamard transform efficiently. "25 faster execution than a naive fast Walsh--Hadamard transform (FWHT) implementation"
- Gaussianity: The property of having a Gaussian (normal) distribution; here, the rotated and normalized coordinates approach normality. "Gaussianity of Rotated Coordinates"
- GGUF: A model file format commonly used in LLM tooling ecosystems. "PolarQuant is compatible with any downstream quantizer (torchao, GGUF, MLX)"
- Graph surgery: Modifying the computation graph to insert or absorb transformations without changing outputs. "requiring graph surgery to absorb rotations into adjacent linear layers"
- Group-wise absmax: Quantization where a shared scale per group (e.g., 128 elements) is set by the group’s maximum absolute value. "torchao INT4~\cite{torchao2024} with group-wise absmax quantization (group size 128)"
- Hadamard rotation: An orthogonal rotation using the Hadamard matrix to spread energy and homogenize values, aiding quantization. "Hadamard rotation alone reduces quantization error by 98\%"
- Hessian: The matrix of second-order partial derivatives; used to approximate curvature for error compensation in quantization. "GPTQ~\cite{frantar2023gptq} performs layer-wise quantization using approximate second-order (Hessian) information"
- i.i.d.: Independent and identically distributed, indicating identical distribution and independence across variables. "well-approximated by i.i.d.\ standard normal random variables"
- Information-theoretic lower bounds: Fundamental limits on performance (e.g., distortion-rate) dictated by information theory. "TurboQuant proves information-theoretic lower bounds and achieves near-optimal distortion rates."
- INT4: A 4-bit integer data type used for compact weight/activation storage and fast inference. "PolarQuant Q5 dequantized and re-quantized by torchao INT4 achieves perplexity 6.56 versus 6.68 for direct absmax INT4"
- Kolmogorov--Smirnov statistic: A nonparametric measure of distributional difference used to assess Gaussian fit. "the Kolmogorov--Smirnov statistic between the empirical distribution of rotated LLM weight coordinates and is typically below 0.01."
- KV cache: The key-value memory in transformer attention used to speed up autoregressive inference. "normalizing and rotating KV cache vectors via Hadamard transforms"
- Lattice codebooks: Structured sets of quantization levels arranged as lattice points to improve low-bit quantization. "lattice codebooks for 2-bit quantization"
- Lloyd--Max algorithm: An iterative procedure that computes the MSE-optimal scalar quantizer for a given distribution. "The Lloyd--Max algorithm~\cite{lloyd1982least, max1960quantizing} computes the MSE-optimal scalar quantizer for a given source distribution."
- Lloyd--Max centroids: Quantization levels obtained from the Lloyd–Max algorithm that minimize mean squared error for the source. "Lloyd--Max centroids provide only a marginal additional gain at Q5"
- LLM: A neural network trained on extensive text data to perform language tasks. "a post-training weight quantization method for LLMs"
- MLX: Apple’s array framework for machine learning on Apple Silicon. "we evaluate PolarQuant on Apple Silicon using the MLX framework~\cite{mlx2024}"
- MSE (mean squared error): The average squared difference between original and reconstructed values; the principal distortion metric here. "The Lloyd--Max algorithm~\cite{lloyd1982least, max1960quantizing} computes the MSE-optimal scalar quantizer for a given source distribution."
- NormalFloat (NF4): A quantization data type with levels uniformly spaced in the standard normal’s quantile domain. "Dettmers et al.~\cite{dettmers2024qlora} introduced NormalFloat, a data type with quantization levels uniformly spaced in the quantile domain of the standard normal distribution."
- Optimal Brain Surgeon framework: A second-order pruning/compensation method used here to reduce quantization error after column-wise quantization. "compensating the error in remaining columns via the optimal brain surgeon framework"
- Per-channel scaling: Applying individual scale factors to channels to protect sensitive weights during quantization. "AWQ computes per-channel scaling factors from calibration activations"
- Perplexity: A standard language-model evaluation metric; lower is better. "reducing Qwen3.5-9B perplexity from 6.90 (absmax Q5) to 6.40 ( from FP16)"
- Post-training quantization: Quantizing a trained model without further training, often using calibration data or distributional assumptions. "a post-training weight quantization method for LLMs"
- QuaRot: A method applying rotations to remove outliers and enable 4-bit inference without quality loss. "QuaRot~\cite{ashkboos2024quarot} demonstrated that applying Hadamard rotations to hidden states, activations, and KV cache removes outliers without changing model output"
- QuIP: A technique that applies incoherence processing to improve quantization bounds via random rotations. "QuIP~\cite{tseng2024quip_original} introduced incoherence processing for weight quantization"
- QuIP#: An extension of QuIP that uses randomized Hadamard transforms and lattice codebooks for enhanced low-bit quantization. "QuIP#~\cite{chee2024quip} extended this with randomized Hadamard transforms (RHT) and lattice codebooks for 2-bit quantization."
- Rademacher: A random variable taking ±1 with equal probability; appears in characterizing coordinate distributions. "\cdot \text{Rademacher},"
- Randomized Hadamard Transform (RHT): A randomized variant of the Hadamard transform used to induce incoherence and improve quantization. "randomized Hadamard transforms (RHT)"
- Self-inverse (matrix): A matrix equal to its own inverse; applying it twice yields the identity. "The Hadamard matrix is deterministic and self-inverse."
- Torchao: PyTorch Architecture Optimization tooling providing quantization backends like INT4. "torchao INT4~\cite{torchao2024}"
- TurboQuant: A method that applies polar quantization online with near-optimal distortion-rate trade-offs. "TurboQuant~\cite{ashkboos2025turboquant} applies the polar quantization framework to KV cache compression during inference"
- Voronoi region: The set of points nearest to a given centroid; partitions the space for scalar quantization. "each centroid equals the conditional expectation , where is the Voronoi region of "
- Walsh--Hadamard matrix: An orthogonal, self-inverse matrix with ±1 entries used for fast, structured rotations. "The Walsh--Hadamard matrix of order is defined recursively:"
Collections
Sign up for free to add this paper to one or more collections.