Collaborative Memory Transformer (CoMeT)

Updated 9 February 2026

The paper introduces CoMeT, a dual-memory Transformer extension that achieves linear time complexity and constant GPU memory usage for ultra-long sequences.
It uses a gated global memory and a FIFO temporary memory to retain both long-term context and recent high-resolution details with minimal parameter overhead.
Empirical evaluations show significant speedups and competitive performance on benchmarks, highlighting its efficiency over traditional full-attention mechanisms.

The Collaborative Memory Transformer (CoMeT) is an architectural extension for Transformers that enables efficient and high-fidelity processing of sequences with arbitrarily long contexts. CoMeT addresses the fundamental limitations of the standard Transformer—namely, quadratic time complexity and an indefinitely growing key-value (KV) cache—by introducing a dual-memory system and chunkwise processing. The architecture supports constant memory usage and linear time complexity, while preserving both long-term and short-term context information. CoMeT is implemented as a parameter-efficient plug-in module, allowing integration into existing pre-trained Transformer models with minimal fine-tuning requirements (Zhao et al., 2 Feb 2026).

1. Motivations and Objectives

Traditional Transformer models incur $O(N^2)$ time and $O(N)$ space complexity for a sequence of length $N$ , primarily due to the global self-attention mechanism and the requirement to store all key-value pairs in the cache for context retention. This characteristic makes processing extremely long contexts, such as tens or hundreds of thousands of tokens, intractable on commodity hardware. While prior finite-state or recurrent approaches deliver $O(N)$ time and $O(1)$ space, they typically suffer from one or more of: lack of explicit gating (resulting in catastrophic forgetting of salient information), or an inability to retain fine-grained recent details.

CoMeT’s primary objectives are:

To process arbitrarily long input sequences with constant GPU memory and linear time complexity.
To retain both persistent long-term memory and recent, high-resolution context.
To provide a parameter-efficient, minimally invasive module—enabling rapid and effective adaptation of pre-trained LLMs to ultra-long-range tasks.

2. Architectural Composition

CoMeT processes sequences as consecutive fixed-size chunks of length $C$ (e.g., $C=2,048$ tokens) and operates at the granularity of Transformer layers. At each layer index $i$ and chunk index $\tau$ , the architecture introduces two sets of auxiliary memory tokens:

Global memory $G^i_\tau$ : Encodes persistent, salient historical information using a compact state per layer.
Temporary memory $T^i_\tau$ : Maintains a FIFO queue of high-resolution representations for recent chunks.

The input to each chunk is formed by prepending $G^i_\tau$ and $T^i_\tau$ to the chunk’s hidden states $H^i_\tau$ , interleaving compression tokens $C^i_\tau$ , and appending readout tokens $R^i_\tau$ . All tokens interact through causal self-attention.

2.1 Dual-Memory Design

Global Memory

Each layer $i$ maintains a persistent global memory state $S^i_\tau \in \mathbb{R}^d$ , transformed to memory tokens via a residual low-rank adapter (RLA):

$\mathrm{RLA}(X) = X + W_\mathrm{up}(W_\mathrm{down} X)$

At each chunk, the readout tokens $R^{i+1}_\tau$ are RMS-normalized to yield candidate states $\tilde{S}^i_{\tau+1}$ . The update of the persistent state employs a gating mechanism:

$\begin{align*} g &= \sigma(W_g [S^i_\tau; \tilde{S}^i_{\tau+1}]) \in (0,1)^d \ S^i_{\tau+1} &= g \odot S^i_\tau + (1-g) \odot \tilde{S}^i_{\tau+1} \end{align*}$

where $W_g \in \mathbb{R}^{1 \times 2d}$ , $\sigma$ is the sigmoid function, and $\odot$ denotes elementwise multiplication.

Temporary Memory

Temporary memory is a per-layer FIFO queue of fixed capacity $M$ . For each new chunk, entries are derived from RMS-normalized compression tokens passed through the RLA adapter:

$T_\text{entry} = \mathrm{RLA}(\mathrm{RMSNorm}(C^{i+1}_\tau))$

This queue structure provides high-resolution retention of the most recent context, mitigating catastrophic forgetting of near-term events.

2.2 Attention Mechanism and Dynamic Soft Prompt

At each chunk, the queries $Q$ formed from $H^i_\tau$ attend to the keys and values resulting from the concatenation $[G^i_\tau; T^i_\tau; H^i_\tau]$ . The global and temporary memories thus act as adaptive, content-based “soft prompts” conditioning the chunk’s processing.

2.3 Processing Sequence

CoMeT’s forward pass is organized as follows:

Initialize Sⁱ₀ ← 0 for all layers i
Initialize empty FIFO queue Tⁱ₀ for each layer i
for each chunk τ = 1…⌈N/C⌉ do
  H⁰_τ ← embedding(tokens_τ)
  C⁰_τ, R⁰_τ ← ∅
  for each layer i = 1…L do
    # Prepend memories, interleave tokens
    inputⁱ_τ ← concat(Gⁱ_{τ}, Tⁱ_{τ}, interleave(Hⁱ_{τ}, Cⁱ_{τ}), Rⁱ_{τ})
    # Transformer layer update
    Hⁱ⁺¹_τ, Cⁱ⁺¹_τ, Rⁱ⁺¹_τ = TransformerLayerⁱ(inputⁱ_τ)
    # Global memory update (gated)
    compute \tilde Sⁱ_{τ+1} = RMSNorm(Rⁱ⁺¹_τ)
    compute g = σ(W_g [Sⁱ_τ; \tilde Sⁱ_{τ+1}])
    Sⁱ_{τ+1} = g ⊙ Sⁱ_τ + (1–g) ⊙ \tilde Sⁱ_{τ+1}
    # New global memory
    Gⁱ_{τ+1} = RLA(Sⁱ_{τ+1})
    # Temporary memory update
    enqueue (Tⁱ_{τ+1}, RLA(RMSNorm(Cⁱ⁺¹_τ)))
    if size(Tⁱ_{τ+1}) > M: dequeue oldest entry
  end for
end for

3. Complexity and Theoretical Analysis

Computational Characteristics

Whereas full-attention Transformers require $O(N^2)$ time and $O(N)$ space for sequences of length $N$ , CoMeT processes tokens in chunked batches of size $C$ , with fixed memory sizes:

Per-chunk attention cost: $O((C+M_\text{glob}+M_\text{temp}) \cdot C)$ , which reduces to $O(C^2)$ when $M \ll C$ .
Total sequence time: $O(N\cdot C) = O(N)$ for fixed $C$ .
Peak space usage: $O(C + M_\text{glob} + M_\text{temp}) = O(1)$ with respect to $N$ .

Many recurrent-Transformer and finite-state variants achieve similar asymptotic complexity, but typically lack robust separation between preserving salient long-term facts and short-term details. CoMeT’s dual-memory architecture—with both gating (for persistent state protection) and a FIFO rolling window—addresses this gap (Zhao et al., 2 Feb 2026).

4. Training Procedures and Integration

Plug-In Architecture and Fine-Tuning

Integration of CoMeT into a pre-existing Transformer requires introducing:

RLA adapters per layer (with $O(2dr)$ parameters for low-rank $r \ll d$ ),
A gating matrix $W_g \in \mathbb{R}^{1 \times 2d}$ per layer.

Only these additional weights (usually $\approx 0.3\%$ of total model parameters) are trained during supervised fine-tuning, with the main parameters of the backbone left frozen. Fine-tuning converges after three epochs over chunked long-context data (e.g., on 32k-token chunks).

Efficient Layer-Level Pipeline Parallelism

Conventional parallelism, where each GPU processes entire chunks in sequence, results in substantial idle time and under-utilization. CoMeT introduces layer-level pipeline parallelism:

After completing layer $i$ on chunk $\tau$ , a worker immediately transmits the updated $S^i_{\tau+1}$ and $T^i_{\tau+1}$ to the next worker, which can process layer $i$ for chunk $\tau+1$ in parallel.
This strategy empirically yields a $2.7\times$ speedup (on 128k-token contexts with 16 GPUs) over naive chunkwise pipelining.

5. Empirical Evaluation

Passkey Retrieval at Scale

In the “passkey retrieval” experiment, where a numeric target is hidden within $1$ million distractor tokens and training occurs on 32k-token contexts, CoMeT achieves $100\%$ accuracy in retrieving the passkey irrespective of its position. This is accomplished with $21\times$ lower inference speed and $10\times$ smaller memory usage compared to a full-attention baseline on the same input length.

SCROLLS Benchmark Performance

On the SCROLLS benchmark (7 long-context tasks), models with CoMeT and 32k training context lengths demonstrate:

Higher average score (40.10) than all other plug-in (“compression-based” and “finite-state”) baselines (next best: SWA, 38.24).
On summarization (GovReport, SummScreen), performance is on par with a fully fine-tuned full-attention baseline (e.g., 62.5 ROUGE-1 vs. 61.0).

Real-World Agent and User-Behavior QA

On user-behavior QA (UQA) involving e-commerce clickstream data (4k memory), CoMeT exceeds xRAG (retrieval-augmented baseline) by 2.7 percentage points and greatly outperforms a truncated 4k full-attention model by 27 percentage points.
In the long-horizon agent task (128k token trajectories), layer-level pipelined training with CoMeT is $2.7\times$ faster than naive pipelining and attains full-attention performance (Terminal-Bench) despite constant memory usage.

6. Implementation Guidelines and Future Directions

Practical Integration Steps

Insert RLA adapters and gating matrices at every Transformer layer.
Initialize global memory state $S$ to zero per layer.
Set up the FIFO temporary memory queue of chosen length.
Modify the forward pass to prepend $G$ and $T$ tokens, and interleave $C$ , $R$ tokens as specified.
Fine-tune solely the RLA and gate parameters using chunked long-context data.

Limitations and Extensions

CoMeT currently maintains exclusively intrinsic model memory and does not incorporate episodic memory, test-time training, or connectivity to external knowledge (such as RAG retrieval or notebook memory). Future research priorities include:

Dynamic adjustment of memory capacities based on information salience.
Hierarchical or topic-sensitive gating strategies.
Integration with external retrieval or episodic augmentation modules.
Adapting the architecture for multimodal contexts, including video+text scenarios.

7. Summary and Positioning

CoMeT presents a principled methodology for efficient long-context modeling in Transformer-based architectures, unifying a gated global memory and a FIFO temporary memory to achieve linear time complexity, constant memory requirements, and empirical robustness across both synthetic and real-world scenarios. Its plug-and-play character and minimal fine-tuning regime facilitate practical adoption in existing models facing the challenges of ultra-long-range sequence processing (Zhao et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Collaborative Memory Transformer (CoMeT).

Collaborative Memory Transformer (CoMeT)

1. Motivations and Objectives

2. Architectural Composition

2.1 Dual-Memory Design

Global Memory

Temporary Memory

2.2 Attention Mechanism and Dynamic Soft Prompt

2.3 Processing Sequence

3. Complexity and Theoretical Analysis

Computational Characteristics

4. Training Procedures and Integration

Plug-In Architecture and Fine-Tuning

Efficient Layer-Level Pipeline Parallelism

5. Empirical Evaluation

Passkey Retrieval at Scale

SCROLLS Benchmark Performance

Real-World Agent and User-Behavior QA

6. Implementation Guidelines and Future Directions

Practical Integration Steps

Limitations and Extensions

7. Summary and Positioning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Collaborative Memory Transformer (CoMeT)

1. Motivations and Objectives

2. Architectural Composition

2.1 Dual-Memory Design

Global Memory

Temporary Memory

2.2 Attention Mechanism and Dynamic Soft Prompt

2.3 Processing Sequence

3. Complexity and Theoretical Analysis

Computational Characteristics

Comparison with Related Methods

4. Training Procedures and Integration

Plug-In Architecture and Fine-Tuning

Efficient Layer-Level Pipeline Parallelism

5. Empirical Evaluation

Passkey Retrieval at Scale

SCROLLS Benchmark Performance

Real-World Agent and User-Behavior QA

6. Implementation Guidelines and Future Directions

Practical Integration Steps

Limitations and Extensions

7. Summary and Positioning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research