Clustering-Based Memory Compression

Updated 31 January 2026

Clustering-based memory compression is a method that groups similar high-dimensional objects to optimize storage and maintain reconstruction fidelity.
The strategy involves preprocessing data, applying k-means or its variants, and replacing parameters with averaged, centroid-derived, or low-rank representations.
Empirical benchmarks demonstrate that this approach effectively balances memory efficiency and task performance across diverse domains like LLMs, embedding tables, and databases.

A clustering-based memory compression strategy groups high-dimensional objects (e.g., neural memories, weight vectors, tokens, binary patterns, matrix blocks) by similarity and merges or replaces them in a manner that dramatically reduces memory footprint while controlling reconstruction fidelity. This approach leverages unsupervised clustering—typically k-means or its differentiable variants—to produce shared centroids, averaged representations, or low-rank factorizations. Clustering-driven compression is now a core paradigm for model weights, user memories in LLMs, embedding tables, binary neural kernels, matrices for SVD, and even Kolmogorov-Arnold Networks (KANs). It provides a systematic basis for balancing efficiency and task performance across diverse machine learning domains.

1. Clustering Principles and Formal Objectives

The foundational principle is to replace individual parameters or data objects—which are often highly redundant or semantically similar—with grouped representations that minimize within-cluster distortion under some metric. The general k-means clustering objective for $N$ objects $x_i \in \mathbb{R}^d$ is

$\min_{C, \alpha} \sum_{i=1}^N \| x_i - c_{\alpha(i)} \|^2$

where $C = \{c_k\}_{k=1}^K$ are centroid vectors, $\alpha: \{1,\dots,N\} \to \{1,\dots,K\}$ assigns each object to a cluster, and $K$ is the number of clusters. The practical goal is to choose $K$ and the merge/compression operation within each cluster to optimize storage, computational cost, and reconstruction or generation quality (Bohdal et al., 24 Jan 2026, Tsang et al., 2022, Cho et al., 2021).

In specialized settings—such as block-wise cluster encoding in columnar databases (Jayanth, 2016), matrix concatenation under SVD error constraints (Shamrai, 12 Jan 2026), or assignment of binary patterns based on Hamming distance (Silfa et al., 2022)—the clustering objective and structure are adapted, but the unifying mechanism remains group-wise minimization or preservation of salient content.

2. Algorithms and Compression Pipelines

Compression is a multi-stage process, typified by these steps:

Preprocessing and Representation: Objects (memories, matrices, weights) are encoded or flattened into suitable vectors for clustering. For embedding tables, cluster assignments and codebook centroids are employed (Tsang et al., 2022). Models such as KANs rely on meta-learners that shape coefficients to lie on a low-dimensional manifold for better clusterability (Raffel et al., 21 Oct 2025).
Clustering: Standard, mini-batch, or differentiable k-means is performed over object vectors. In fast clustering for spatial images, a linear-time agglomeration exploits grid adjacency and local nearest neighbor graphs for scalable partitioning (Thirion et al., 2015). Dynamic expert clustering in MoE LLMs uses fused parameter-and-activation similarity metrics (Zhu et al., 27 Sep 2025).
Within-Cluster Merging or Replacement:
- Averaging: Token-wise or object-wise averaging is calculated for each cluster (as in clustering-based memory blocks for personalized LLM prompts) (Bohdal et al., 24 Jan 2026).
- Codebook Assignment: Each object is replaced by its centroid's entry in a codebook (embedding tables (Tsang et al., 2022), model weights (Liao et al., 17 Mar 2025), KAN coefficients (Raffel et al., 21 Oct 2025)).
- Low-rank/Structured Factorization: Clusters are compressed by shared bases and low-rank residuals (MoE experts (Zhu et al., 27 Sep 2025), concatenated matrix blocks (Shamrai, 12 Jan 2026)).
- Pattern Mapping: In BNNs, rare bit-patterns are mapped to frequent centroids under a bounded distortion (Hamming distance) (Silfa et al., 2022).
Output and Use: The compressed representations—clustered memory blocks, codebook+indices, fused kernels, low-rank bases—replace or augment the originals for inference, generation, or statistical modeling.

3. Computational Complexity and Trade-offs

The computational profile is determined by object dimension, number of clusters, and the merge operation. For k-means, complexity is $O(TNKd)$ where $T$ is the number of iterations, $N$ is number of objects, $K$ clusters, $d$ dimension. Fast clustering on spatial grids is $O(p)$ due to adjacency-based nearest-neighbor graphs (Thirion et al., 2015). Block size-optimized cluster encoding uses a dynamic programming over run-length summaries for each candidate block size, $O(N \log N)$ (Jayanth, 2016).

Merging and codebook assignment typically add minimal overhead. On-device clustering of LLM memories (N=8, D_m=128, D_e=2048, K=4) costs <50 ms per pass (Bohdal et al., 24 Jan 2026). DBMS cluster encoding and codebook-based model weight clustering (ClusComp (Liao et al., 17 Mar 2025)) maintain query and inference throughput at parity versus baselines. Compression-aware matrix clustering enables explicit control of SVD error upper bounds and supports scalable incremental updates (Shamrai, 12 Jan 2026).

Trade-offs are governed by $K$ , context or token budgets, and allowable distortion. Larger $K$ yields finer-grained recovery and less information loss but more memory tokens or codebook overhead; smaller $K$ boosts compression but may cause performance drop (see Fig. 3 and tables in (Bohdal et al., 24 Jan 2026, Tsang et al., 2022, Liao et al., 17 Mar 2025, Raffel et al., 21 Oct 2025)).

4. Performance Metrics and Empirical Results

Robust quantitative evaluation is central to clustering-based compression. Metrics include ROUGE-L for LLM generation (Bohdal et al., 24 Jan 2026), logistic regression and ICA accuracy for imaging (Thirion et al., 2015), perplexity and task accuracy for LLMs (Cho et al., 2021, Liao et al., 17 Mar 2025, Zhu et al., 27 Sep 2025), memory reduction ratios, and click-through accuracy for recommendation systems (Tsang et al., 2022).

Strategy	Compression Ratio	Quality Drop	Use Case
Token-clustered memory	2×	+0.19 ROUGE-L	On-device LLMs (Bohdal et al., 24 Jan 2026)
Fast spatial clustering	10–20×	+2–5% accuracy	Brain images (Thirion et al., 2015)
BSO cluster encoding	1.15–1.25×	None	DB columns (Jayanth, 2016)
CCE for embeddings	16–64×	<1% accuracy	Recsys (Tsang et al., 2022)
ClusComp blocks	4–34×	0–2% (PPL, acc)	LLM weights (Liao et al., 17 Mar 2025)
MoE expert clusters	5–8×	<2% GLUE/Wiki103	Sparse LLMs (Zhu et al., 27 Sep 2025)
MetaCluster for KANs	32–80×	<1% accuracy	KANs (Raffel et al., 21 Oct 2025)
eDKM for train-time	130× (mem)	<2% accuracy	LLM fine-tune (Cho et al., 2023)
BNN kernel clustering	1.32×	<0.1% accuracy	BNNs/ImageNet (Silfa et al., 2022)

Clustering-based strategies consistently outperform simple concatenation, mean-pooling, or uniform quantization at matched memory budgets. For example, clustering on on-device LLM memory halves context tokens and improves ROUGE-L vs. naive baselines (Bohdal et al., 24 Jan 2026). CCE achieves a memory reduction of $32\times$ with <0.5% accuracy loss on huge embedding tables (Tsang et al., 2022). For ultra-large LLMs, ClusComp achieves low-bit compression (down to 1 bit) surpassing quantization-based GPTQ/AWQ (Liao et al., 17 Mar 2025). MetaCluster compresses KAN parameters by $80\times$ without task loss (Raffel et al., 21 Oct 2025). eDKM enables train-time clustering on 7B LLMs with 130× memory savings (Cho et al., 2023).

5. Domain-Adapted Clustering and Special Cases

Clustering-based memory compression is context-sensitive and often requires domain-aware objective choices and structural adaptations.

Structured Data: For spatial images, clustering leverages lattice neighborhood, ensuring clusters reflect anatomical structure and enables linear-time partitioning (Thirion et al., 2015).
Database Columns: Block-size-optimized (BSO) cluster encoding determines block size that maximizes compressible all-equal blocks, employing run-length dynamic programming analytics (Jayanth, 2016).
Matrix Collections: Joint clustering leads to compression-aware SVD grouping across matrix blocks subject to rigorous error certificates (Shamrai, 12 Jan 2026).
BNN Kernels: Constrained nearest-centroid mapping in binary/Hamming space enables aggressive pattern clustering and Huffman compression in hardware (Silfa et al., 2022).
Expert Networks: Dynamic expert regrouping in MoE models uses combined parameter and activation similarity, with intra-cluster low-rank adapters for hierarchical routing and quantized storage (Zhu et al., 27 Sep 2025).
Neural Architectures: MetaCluster leverages a meta-learner to project Kolmogorov-Arnold coefficients onto a clusterable manifold, followed by k-means and codebook replacement (Raffel et al., 21 Oct 2025).

These adaptions ensure clustering exploits latent structure, preserves crucial signal, and achieves coherence across highly heterogeneous domains.

6. Guidelines, Limitations, and Practical Deployment

Selecting $K$ , token/bit budgets, and context-specific compression targets is critical. Trade-off curves generally show rapid gains with $K \leq 4$ –$8$ (for token blocks, embedding tables, weight clusters), after which diminishing returns set in. Clustering costs are marginal for inference but can add 5-10% latency during compression passes; efficient pipeline design is vital for edge and on-device deployment (Bohdal et al., 24 Jan 2026, Tsang et al., 2022).

Specialized clustering implementations—differentiable layers (Cho et al., 2021), memory marshaling/sharding (Cho et al., 2023), fast graph-based grouping (Thirion et al., 2015), incremental matrix SVD tracking (Shamrai, 12 Jan 2026), hardware acceleration for decoding (Silfa et al., 2022)—address scalability and overhead concerns.

Common limitations:

Fixed code assignments post-clustering restrict dynamic adaptation (Liao et al., 17 Mar 2025).
Lookup indices and codebooks introduce indirection and require careful storage layout.
Meta-learner for geometry shaping (KANs) adds training hyperparameter complexity (Raffel et al., 21 Oct 2025).
Some approaches demand explicit error thresholding, which may require validation tuning (Shamrai, 12 Jan 2026).

A plausible implication is that further research into adaptive, domain-general clustering with error and cost guarantees could extend these gains, especially in online learning and streaming contexts.

7. Reference Implementations and Empirical Benchmarks

Reference implementations span popular packages (SciPy/Scikit-learn for graph clustering (Thirion et al., 2015), PyTorch hooks for memory-efficient DKM (Cho et al., 2023), hardware microkernel enhancements for BNNs (Silfa et al., 2022)) and open-source releases for LLM clustering pipelines and codebook methods (Liao et al., 17 Mar 2025, Bohdal et al., 24 Jan 2026).

Empirical benchmarks demonstrate end-to-end speedups (up to $20\times$ for ICA on large-scale datasets (Thirion et al., 2015)), parameter reductions ( $80\times$ in MetaCluster (Raffel et al., 21 Oct 2025)), and real-world deployment viability on edge/mobile hardware. Compression strategies are tailored and validated on datasets including LaMP (personalized memory tasks), OASIS and HCP fMRI (medical imaging), WikiText-103 and GLUE (LLMs), C4/MMLU (Zero-shot reasoning), and ImageNet (BNNs).

In summary, clustering-based memory compression is a unifying technical framework for reducing model and data memory costs, with empirical superiority and strong theoretical support across domains ranging from structured images to transformers and database systems.