Papers
Topics
Authors
Recent
Search
2000 character limit reached

Engram-Nine: 9-gram Memory Modules

Updated 27 January 2026
  • Engram-Nine is a conditional memory module that stores up to 9-gram suffixes using static, O(1) lookup to complement dynamic MoE in Transformers.
  • It employs a dual-tier routing mechanism with collision-free hot tiers and standard cold tiers to balance retrieval precision with implicit regularization.
  • Integration into Transformer architectures leverages optimized scaling laws and adaptive gating to accelerate pattern reconstruction and improve overall efficiency.

Engram-Nine refers to a class of conditional memory modules for LLMs that extend the Engram approach to retrieval of up to 9-gram suffixes. Engram modules instantiate a static, O(1)\mathcal{O}(1) lookup-based memory for storing and conditionally injecting representations of fixed string patterns (such as n-grams) into a Transformer architecture. The goal of such modules is to complement dynamic, compute-intensive conditional computation (as in Mixture-of-Experts, MoE) with a memory mechanism optimized for rapid access to repetitive or lexicalized patterns, thereby improving efficiency and specialization within very large models (Cheng et al., 12 Jan 2026, Lin, 23 Jan 2026).

1. Architectural Principles and Retrieval Mechanism

Engram-Nine is distinguished by its support for suffixes (n-grams) up to order N=9N=9. For each token position tt, trailing compressed n-gram suffixes gt,n=(xtn+1,...,xt)g_{t,n}=(x'_{t-n+1}, ..., x'_t), with n=2,,9n=2,\ldots,9, are computed, where xx' are token IDs after a surjective vocabulary compression to collapse semantically equivalent subwords, reducing V|V| by approximately 23% (Cheng et al., 12 Jan 2026). Retrieval from the conditional memory proceeds in parallel over KK hash heads per order, producing embedding vectors et,n,ke_{t,n,k} from tables En,kE_{n,k}:

zt,n,k=h^n,k(gt,n)modMn,kz_{t,n,k} = \hat{h}_{n,k}(g_{t,n}) \bmod M_{n,k}

where h^n,k\hat{h}_{n,k} is a lightweight multiplicative-XOR hash, and Mn,kM_{n,k} is prime-sized. Embeddings are concatenated across all heads/orders:

et=n=2Nk=1Ket,n,kRdmeme_t = \big\|_{n=2}^N \big\|_{k=1}^K e_{t,n,k} \in \mathbb{R}^{d_\mathrm{mem}}

with dmem=n,kdnd_\mathrm{mem} = \sum_{n,k} d_n, dnd_n being the dimension per head/order.

Engram-Nine advances the memory scale: in plausible configurations, K=8K=8 heads and dmem=2560d_\mathrm{mem}=2560 entail allocation of \sim10 billion parameters for the static memory. The addressing is O(1)\mathcal{O}(1) and entirely deterministic.

Recent extensions such as the collision-free Engram-Nine architecture introduce a two-tier routing system. The "hot tier" handles the most frequent NN n-grams via a minimal perfect hash function (MPHF), providing collision-free lookup, while a standard multi-head hash ("cold tier") is retained for the remainder (Lin, 23 Jan 2026):

  • Hot tier: for gSg \in S (the NN most frequent n-grams), retrieve eghot=Ehot[h(g)]e^\mathrm{hot}_g = E_\mathrm{hot}[h(g)].
  • Cold tier: for gSg \notin S, retrieve egcolde^\mathrm{cold}_g as in the standard scheme.

Both tiers generate vectors of matched shape, maintaining architectural compatibility.

2. Sparsity Allocation, Scaling Laws, and Iso-Parameter Design

Parameter allocation is governed by a U-shaped scaling law that arises when partitioning the total parameter budget PtotP_\mathrm{tot} between active (FLOPs-proportional) parameters PactP_\mathrm{act} and inactive (lookup-based) Psparse=PtotPactP_\mathrm{sparse}=P_\mathrm{tot}-P_\mathrm{act} (Cheng et al., 12 Jan 2026). Defining ρ\rho as the fraction of PsparseP_\mathrm{sparse} for MoE experts:

PMoEsparse=ρPsparse,PEngram=(1ρ)PsparseP_\mathrm{MoE}^{\mathrm{sparse}} = \rho\,P_\mathrm{sparse}, \quad P_\mathrm{Engram} = (1-\rho) P_\mathrm{sparse}

Validation loss as a function of ρ\rho,

L(ρ)L(\rho)

displays a pronounced minimum for ρ0.75\rho \approx 0.75–$0.8$, being suboptimal for pure MoE (ρ1\rho \to 1) or pure lookup (ρ0\rho \to 0). This constrains Engram-Nine’s optimal configuration under iso-FLOPs and iso-parameter constraints.

Scaling the memory tables for higher NN leads to an “infinite-memory regime,” with validation loss decreasing according to a power law as the number of slots increases. This enables Engram-Nine to allocate an arbitrarily large static memory pool without incurring additional computational cost per forward pass.

3. Transformer Integration and Fusion Mechanisms

Within a standard Transformer backbone, Engram-Nine modules are typically inserted at early layers—commonly layer $2$ and the midpoint layer (e.g., $15$ for L=30L=30). In a multi-branch ("mHC") residual setup, each branch receives its own key projection, but value projections may be shared. The overall Engram injection protocol per layer involves:

  1. Retrieval of ete_t as above.
  2. Computation of an adaptive gating scalar

α=σ(RMSNorm(h())RMSNorm(WKe)d)\alpha = \sigma\left(\frac{\operatorname{RMSNorm}(h^{(\ell)}) \cdot \operatorname{RMSNorm}(W_K e)}{\sqrt{d}} \right)

  1. Value projection and fusion:

v=WVe;U=αv;Y=SiLU ⁣(Conv1D(RMSNormU))+Uv = W_V e; \quad U = \alpha \odot v; \quad Y = \operatorname{SiLU}\!\left(\operatorname{Conv1D}(\operatorname{RMSNorm} U)\right) + U

  1. The output YY fuses residually into h()h^{(\ell)}, followed by standard attention and MoE layers (Cheng et al., 12 Jan 2026).

Collision-free hot-tiers (MPHF routing) induce no changes to downstream interfaces: both hot and cold paths maintain the same dimensional alignment, facilitating ablations and iso-parametric comparisons (Lin, 23 Jan 2026).

4. Training Dynamics, Route Stratification, and Gating Issues

Route-stratified evaluation—partitioning loss by hot/cold lookup route—uncovers distinct phases during Engram-Nine training (Lin, 23 Jan 2026):

  • Early training: Hot n-grams (handled by the hot-tier or frequently colliding keys in standard hash) have lower loss.
  • Later: A "hot-to-cold advantage flip" occurs, with less-frequently accessed n-grams (cold) exhibiting lower final loss.

Empirically, collision-free Engram-Nine variants precipitate an earlier flip (e.g., at 2000-2750 steps) compared to collision-prone hash tables (≈3000 steps), with quantitative differences in the hot-cold loss gap post-flip (+0.10+0.10 to +0.17+0.17 versus +0.07+0.07 to +0.08+0.08).

Crucially, the gating mechanism displays a persistent mismatch: the trained gate αt\alpha_t repeatedly assigns higher weights to hot positions, even after these become harder (have higher loss). This reveals an allocation instability whereby the gate's early preference is "locked in" and fails to reassign credit late in training. Bucketed diagnostics confirm that high-α\alpha tokens concentrate on hot n-grams, yet suffer the highest average losses—contradicting the intention of the gating function (Lin, 23 Jan 2026).

5. Collision Effects, Implicit Regularization, and Ablation Evidence

Hash collisions—inevitable in the original multi-head hashed lookup—induce a robust regularization effect (Lin, 23 Jan 2026). Empirically:

  • Collision-prone models delay the hot-to-cold flip, and post-flip cold advantage is less severe.
  • When multiple frequent n-grams share an embedding slot, the model averages their representations, a mechanism functionally akin to dropout and clustering regularization.
  • Increasing the hot-tier size (thus shrinking the cold table and reducing collisions) advances the flip, indicating that reduced collisions reduce implicit regularization.
  • Iso-parameter and iso-table-size controls confirm that removing collisions via MPHF does not decrease validation loss and may exacerbate overfitting on the most frequent n-grams.

A plausible implication is that hashing-induced collision noise acts as a regularizer, and eliminating it naively (e.g., with a hot-tier extension) does not yield improved generalization.

6. Mechanistic Analysis and Empirical Benchmarking

Mechanistic probes deployed on Engram-Nine architectures reveal:

  • Early-layer static reconstruction is accelerated: using LogitLens (KL divergence between layerwise and final logits), predictions converge $2$–$3$ layers earlier compared to MoE baselines (Cheng et al., 12 Jan 2026).
  • Effective depth, via CKA similarity, increases by $5$–$7$ layers relative to the baseline, signifying that static memory offloads shallow pattern recovery, deepening the dynamic reasoning stack.
  • N-gram lookup modules free attention capacity for long-context tokens. Expected benchmark improvements reported with N=3N=3 extend or slightly increase for Engram-Nine (extrapolated):

| Task | MoE-27B | Engram-Nine | Δ | |----------------------|---------|-------------|------| | MMLU Acc. | 57.4 | ∼60.8 | +3.4 | | CMMLU Acc. | 57.9 | ∼61.9 | +4.0 | | BBH EM | 50.9 | ∼56.0 | +5.1 | | ARC-Chall Acc. | 70.1 | ∼73.8 | +3.7 | | HumanEval Pass@1 | 37.8 | ∼40.8 | +3.0 | | MATH EM | 28.3 | ∼30.7 | +2.4 | | RULER MQ NIAH Acc. | 84.2 | ∼97.0 | +12.8 |

These results indicate that increased n-gram order and memory scale does not degrade, and most likely improves, retrieval and generalization performance (Cheng et al., 12 Jan 2026).

7. Efficiency, Hardware Implications, and Design Recommendations

Deterministic addressing in Engram-Nine enables precomputation and prefetching of lookup indices, allowing embedding rows to be streamed from host RAM (over PCIe), incurring minimal overhead. This is enabled by the independence of lookup indices from the hidden state and their full determination by the input token sequence. Empirical throughput data confirm overheads under 3% even when Engram memory is offloaded far beyond GPU HBM limits, aided by cache-friendly Zipfian access patterns (Cheng et al., 12 Jan 2026).

Given the training dynamics observed, practical deployment recommendations include:

  • Retain collisions at moderate scale: implicit regularization is more robust than precision-optimized hot tiers.
  • Monitor route-stratified loss and gating behavior to diagnose regime shifts ("flip" events) and gate-mismatch.
  • If further tuning is attempted, consider enriching gate signals with collision degree or n-gram frequency, or introducing mechanisms (e.g., EMA resets) to allow late-stage credit reassignment (Lin, 23 Jan 2026).

References

  • "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for LLMs" (Cheng et al., 12 Jan 2026)
  • "A Collision-Free Hot-Tier Extension for Engram-Style Conditional Memory: A Controlled Study of Training Dynamics" (Lin, 23 Jan 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Engram-Nine.