Memory Mosaics: Coordinated Memory Systems

Updated 3 February 2026

Memory mosaics are systems composed of distributed memory units that collaboratively store and retrieve data using associative mechanisms.
They achieve superior performance in few-shot learning and long-context tasks by employing hierarchical architectures and adaptive kernel scheduling, outperforming traditional transformers.
Their design emphasizes interpretability through predictive disentanglement and transparent memory outputs, benefiting both neural sequence models and hardware memory management.

Memory mosaics are a class of systems, models, and mechanisms in which a network—or array—of memory units work jointly to achieve complex storage, retrieval, computation, or inference objectives. The concept has evolved independently in several advanced contexts, including associative-memory-based neural architectures for sequence modeling, schemes for efficient memory management in high-performance hardware, theoretical frameworks for memory effects in physical systems, and empirical characterizations of memorization phenomena in large neural networks. Although the implementation details and system scales vary widely, central to all manifestations of a memory mosaic is the coordination and composition of multiple memories or memory fragments: either as physically distinct memory units collaborating over time or as distributed syntactic/statistical fragments contributing to shared recall or inference.

1. Associative-Memory Networks and Architectural Principles

Memory mosaics, in the context of sequence modeling, refer to architectures composed of parallel and serial networks of associative memories, each storing key–value pairs and performing content-based retrieval via kernel regression mechanisms. Formally, each associative memory maintains a set of stored pairs $(k_i, v_i)$ in $\mathbb{R}^d \times \mathbb{R}^d$ ; on input $k$ , it returns

$f(k; \{(k_i, v_i)\}) = \sum_{i=1}^n w_i v_i, \quad \text{where} \quad w_i = \frac{\exp(\beta k^\top k_i)}{\sum_{j=1}^n \exp(\beta k^\top k_j)}$

with $\beta$ controlling the sharpness of content addressability. In these networks, multiple heads (modules) operate in parallel, each learning their own key and value extractors to specialize in aspects of the prediction task. Outputs from these heads are linearly or nonlinearly combined in a glue layer. During inference, the memory buffers are updated in real-time with new (key, value) pairs, ensuring in-context learning—i.e., adaptation to new patterns within a prompt. Stacked memory mosaic blocks, with or without "persistent" memory layers (trained but not context-updated), can match or outperform transformers at language modeling and few-shot tasks, exhibiting compositional and transparent "predictive disentanglement," wherein each head specializes on substructure in the task (Zhang et al., 2024, Zhang et al., 4 Jul 2025).

2. Scaling Properties and Comparative Performance

Scalability is a critical test for memory mosaic architectures. On medium- to large-scale language modeling benchmarks (GPT-2 to Llama-8B), Memory Mosaics v2 demonstrate that hierarchical (three-level: short-term, long-term, persistent) associative memory structures maintain or exceed the performance of traditional transformers on persistent-knowledge benchmarks. For storage and manipulation of new knowledge—such as needle-in-haystack document queries or few-shot classification on unseen tasks—they outperform size- and data-matched transformers, even when the latter are scaled up by up to 8× in pretraining data. These advantages are attributed to several modifications: adaptive scheduling of the kernel bandwidth parameter $\beta(n)$ as a function of memory size, gated time-variant key extractors providing position-invariance, and explicit separation of memories by timescale. The result is substantial gains in context extrapolation and in-context learning (e.g., on anonymous-label or long-context few-shot classification tasks, Memory Mosaics v2 exceed transformer accuracy by >10 percentage points) (Zhang et al., 4 Jul 2025).

3. Interpretability and Predictive Disentanglement

A defining aspect of memory mosaics is their mechanistic transparency. Each head's contextual memory outputs are explicit conditional-expectation estimates: $y^{(h)}_T = E[v^{(h)}_T \mid k^{(h)}_T]$ . Trained models display "predictive disentanglement": heads autonomously specialize in predicting distinct latent factors or output features, without the need for explicit disentanglement regularization. Visualization of learned weight matrices reveals near-block-diagonal organization, and attention-score plots reflect flat, interpretable context retrieval, free from positional encoding bias. This interpretability is in contrast to standard transformers, whose attention mechanisms can entangle positional and content-based cues in less transparent fashion (Zhang et al., 2024, Zhang et al., 4 Jul 2025).

4. Memory Mosaics in Hardware and System Design

In high-performance GPU memory management, the term "memory mosaics" denotes techniques that coordinate small and large page allocations to optimize the trade-off between address-translation efficiency (translation lookaside buffer reach) and demand-paging overhead. The Mosaic memory manager leverages the tendency of GPU applications to allocate memory en masse, ensuring that contiguous virtual regions are maintained in contiguous physical frames. This enables in-place coalescing and splintering (merging and splitting) of memory pages without data migration and provides transparent support for both efficient translation and fine-grained data movement. The approach is application-transparent and relies on hardware–software cooperation, leading to reductions in page-walk rates, lower latency, and improved overall throughput compared to previous GPU-MMU baselines. Broader implications include the possibility of generalizing such mosaic policies to CPUs and other accelerator architectures (Ausavarungnirun et al., 2018).

5. Statistical and Physical “Mosaic Memory” Phenomena

The mosaic memory effect in LLMs is an empirical observation that LLMs can memorize a target sequence even in the absence of exact duplicate training instances, provided multiple "fuzzy" variants (with a fraction of tokens replaced) are present in the training set. For a reference sequence $X_{\rm ref}$ , the injection of $n_{\rm dup}$ fuzzy variants each at Hamming distance $R$ from $\mathbb{R}^d \times \mathbb{R}^d$ 0 can lead to strong membership inference performance (AUC), even for large $\mathbb{R}^d \times \mathbb{R}^d$ 1. For example, with $\mathbb{R}^d \times \mathbb{R}^d$ 2 and $\mathbb{R}^d \times \mathbb{R}^d$ 3 (i.e., replacing one-third of 100 tokens per variant), AUC remains much higher than the single-duplicate baseline. The dominant mechanism is syntactic: memorization correlates with overlap in substrings, not with semantic similarity. This has significant privacy ramifications, as standard deduplication thresholds (e.g., removal of ≥50-token matches) fail to block this effect, and real-world data often includes many fuzzy duplicates, thus escaping current privacy protections (Shilov et al., 2024).

In glassy and jammed condensed-matter systems, a memory mosaic arises when clusters of constituent particles (modeled as two-state hysterons) interact during periodic driving. Unlike independent hysterons, interacting mosaics allow for the encoding of multiple, potentially overlapping memories (e.g., of different driving amplitudes), as well as long training times and multi-periodic orbits. The background-subtracted return-map signal $\mathbb{R}^d \times \mathbb{R}^d$ 4 reveals plateau heights and cusps corresponding to the strength and number of encoded memories, with the functional form determined by the interaction strengths within the mosaic. Such mosaics contrast with the purely transient, noninteracting memories found in suspensions or other memory-forming systems (Lindeman et al., 2021).

6. Memory Mosaics in Neuromorphic and Diffusion-based Systems

In neuromorphic hardware, the Mosaics framework denotes a strategy for trading spatial memory resources for temporal reuse. Rather than dedicating an independent crossbar for each layer of a deep net, a single (or small set of) crossbar(s) is reused across multiple time steps of a recurrent neural network, increasing expressive depth without increasing footprint. This approach also confers resilience: for similar total parameter counts, RNN mosaics are more robust to device noise, parasitics, and weight perturbations than feed-forward CNNs or MLPs; they also yield large energy savings. Design guidelines advocate for tuning the number of time steps to optimally balance expressivity and resilience, and exploiting inherent attractor dynamics for robustness (Bennett et al., 2020).

For diffusion-based LLMs, Mosaic frameworks refer to memory-efficient inference systems that eliminate system fragmentation and support extreme long-context generation. By combining global memory planning (reserving and scheduling entire memory requirements up front), mask-only kernel strategies (avoiding unnecessary computation for unmasked tokens), and dynamic chunking optimized by heuristic search, these memory mosaics achieve substantial improvements in peak-to-average memory ratio and sequence length supported at inference, without loss of accuracy or incurring major latency costs (Zheng et al., 10 Jan 2026).

7. Limitations and Open Directions

Although memory mosaics offer clear improvements in interpretability, sample efficiency, few-shot generalization, context extrapolation, memory resilience, and hardware resource utilization, they also introduce new challenges. Large-scale associative memories require sublinear query algorithms (e.g., locality-sensitive hashing, approximate kernel regression) to be practical for very large contexts. Parameterization (e.g., optimal kernel bandwidth scheduling), hierarchy design, and integration with gradient-free learning or hybrid discrete-continuous memory banks are ongoing areas of research. In privacy contexts, mosaic memory undermines the sufficiency of naive deduplication strategies, elevating the need for more sophisticated privacy-preserving training methods. Theoretically, no predictive model currently describes or bounds the aggregation of syntactic overlaps into memorization in deep networks; similarly, the bounds on the number and fidelity of encoded memories in physical mosaics remain an open topic (Zhang et al., 2024, Zhang et al., 4 Jul 2025, Shilov et al., 2024, Lindeman et al., 2021).