Collaborative Memory Transformer (CoMeT)
- The paper introduces CoMeT, a dual-memory Transformer extension that achieves linear time complexity and constant GPU memory usage for ultra-long sequences.
- It uses a gated global memory and a FIFO temporary memory to retain both long-term context and recent high-resolution details with minimal parameter overhead.
- Empirical evaluations show significant speedups and competitive performance on benchmarks, highlighting its efficiency over traditional full-attention mechanisms.
The Collaborative Memory Transformer (CoMeT) is an architectural extension for Transformers that enables efficient and high-fidelity processing of sequences with arbitrarily long contexts. CoMeT addresses the fundamental limitations of the standard Transformer—namely, quadratic time complexity and an indefinitely growing key-value (KV) cache—by introducing a dual-memory system and chunkwise processing. The architecture supports constant memory usage and linear time complexity, while preserving both long-term and short-term context information. CoMeT is implemented as a parameter-efficient plug-in module, allowing integration into existing pre-trained Transformer models with minimal fine-tuning requirements (Zhao et al., 2 Feb 2026).
1. Motivations and Objectives
Traditional Transformer models incur time and space complexity for a sequence of length , primarily due to the global self-attention mechanism and the requirement to store all key-value pairs in the cache for context retention. This characteristic makes processing extremely long contexts, such as tens or hundreds of thousands of tokens, intractable on commodity hardware. While prior finite-state or recurrent approaches deliver time and space, they typically suffer from one or more of: lack of explicit gating (resulting in catastrophic forgetting of salient information), or an inability to retain fine-grained recent details.
CoMeT’s primary objectives are:
- To process arbitrarily long input sequences with constant GPU memory and linear time complexity.
- To retain both persistent long-term memory and recent, high-resolution context.
- To provide a parameter-efficient, minimally invasive module—enabling rapid and effective adaptation of pre-trained LLMs to ultra-long-range tasks.
2. Architectural Composition
CoMeT processes sequences as consecutive fixed-size chunks of length (e.g., tokens) and operates at the granularity of Transformer layers. At each layer index and chunk index , the architecture introduces two sets of auxiliary memory tokens:
- Global memory : Encodes persistent, salient historical information using a compact state per layer.
- Temporary memory : Maintains a FIFO queue of high-resolution representations for recent chunks.
The input to each chunk is formed by prepending and to the chunk’s hidden states , interleaving compression tokens , and appending readout tokens . All tokens interact through causal self-attention.
2.1 Dual-Memory Design
Global Memory
Each layer maintains a persistent global memory state , transformed to memory tokens via a residual low-rank adapter (RLA):
At each chunk, the readout tokens are RMS-normalized to yield candidate states . The update of the persistent state employs a gating mechanism:
where , is the sigmoid function, and denotes elementwise multiplication.
Temporary Memory
Temporary memory is a per-layer FIFO queue of fixed capacity . For each new chunk, entries are derived from RMS-normalized compression tokens passed through the RLA adapter:
This queue structure provides high-resolution retention of the most recent context, mitigating catastrophic forgetting of near-term events.
2.2 Attention Mechanism and Dynamic Soft Prompt
At each chunk, the queries formed from attend to the keys and values resulting from the concatenation . The global and temporary memories thus act as adaptive, content-based “soft prompts” conditioning the chunk’s processing.
2.3 Processing Sequence
CoMeT’s forward pass is organized as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Initialize Sⁱ₀ ← 0 for all layers i
Initialize empty FIFO queue Tⁱ₀ for each layer i
for each chunk τ = 1…⌈N/C⌉ do
H⁰_τ ← embedding(tokens_τ)
C⁰_τ, R⁰_τ ← ∅
for each layer i = 1…L do
# Prepend memories, interleave tokens
inputⁱ_τ ← concat(Gⁱ_{τ}, Tⁱ_{τ}, interleave(Hⁱ_{τ}, Cⁱ_{τ}), Rⁱ_{τ})
# Transformer layer update
Hⁱ⁺¹_τ, Cⁱ⁺¹_τ, Rⁱ⁺¹_τ = TransformerLayerⁱ(inputⁱ_τ)
# Global memory update (gated)
compute \tilde Sⁱ_{τ+1} = RMSNorm(Rⁱ⁺¹_τ)
compute g = σ(W_g [Sⁱ_τ; \tilde Sⁱ_{τ+1}])
Sⁱ_{τ+1} = g ⊙ Sⁱ_τ + (1–g) ⊙ \tilde Sⁱ_{τ+1}
# New global memory
Gⁱ_{τ+1} = RLA(Sⁱ_{τ+1})
# Temporary memory update
enqueue (Tⁱ_{τ+1}, RLA(RMSNorm(Cⁱ⁺¹_τ)))
if size(Tⁱ_{τ+1}) > M: dequeue oldest entry
end for
end for |
3. Complexity and Theoretical Analysis
Computational Characteristics
Whereas full-attention Transformers require time and space for sequences of length , CoMeT processes tokens in chunked batches of size , with fixed memory sizes:
- Per-chunk attention cost: , which reduces to when .
- Total sequence time: for fixed .
- Peak space usage: with respect to .
Comparison with Related Methods
Many recurrent-Transformer and finite-state variants achieve similar asymptotic complexity, but typically lack robust separation between preserving salient long-term facts and short-term details. CoMeT’s dual-memory architecture—with both gating (for persistent state protection) and a FIFO rolling window—addresses this gap (Zhao et al., 2 Feb 2026).
4. Training Procedures and Integration
Plug-In Architecture and Fine-Tuning
Integration of CoMeT into a pre-existing Transformer requires introducing:
- RLA adapters per layer (with parameters for low-rank ),
- A gating matrix per layer.
Only these additional weights (usually of total model parameters) are trained during supervised fine-tuning, with the main parameters of the backbone left frozen. Fine-tuning converges after three epochs over chunked long-context data (e.g., on 32k-token chunks).
Efficient Layer-Level Pipeline Parallelism
Conventional parallelism, where each GPU processes entire chunks in sequence, results in substantial idle time and under-utilization. CoMeT introduces layer-level pipeline parallelism:
- After completing layer on chunk , a worker immediately transmits the updated and to the next worker, which can process layer for chunk in parallel.
- This strategy empirically yields a speedup (on 128k-token contexts with 16 GPUs) over naive chunkwise pipelining.
5. Empirical Evaluation
Passkey Retrieval at Scale
In the “passkey retrieval” experiment, where a numeric target is hidden within $1$ million distractor tokens and training occurs on 32k-token contexts, CoMeT achieves accuracy in retrieving the passkey irrespective of its position. This is accomplished with lower inference speed and smaller memory usage compared to a full-attention baseline on the same input length.
SCROLLS Benchmark Performance
On the SCROLLS benchmark (7 long-context tasks), models with CoMeT and 32k training context lengths demonstrate:
- Higher average score (40.10) than all other plug-in (“compression-based” and “finite-state”) baselines (next best: SWA, 38.24).
- On summarization (GovReport, SummScreen), performance is on par with a fully fine-tuned full-attention baseline (e.g., 62.5 ROUGE-1 vs. 61.0).
Real-World Agent and User-Behavior QA
- On user-behavior QA (UQA) involving e-commerce clickstream data (4k memory), CoMeT exceeds xRAG (retrieval-augmented baseline) by 2.7 percentage points and greatly outperforms a truncated 4k full-attention model by 27 percentage points.
- In the long-horizon agent task (128k token trajectories), layer-level pipelined training with CoMeT is faster than naive pipelining and attains full-attention performance (Terminal-Bench) despite constant memory usage.
6. Implementation Guidelines and Future Directions
Practical Integration Steps
- Insert RLA adapters and gating matrices at every Transformer layer.
- Initialize global memory state to zero per layer.
- Set up the FIFO temporary memory queue of chosen length.
- Modify the forward pass to prepend and tokens, and interleave , tokens as specified.
- Fine-tune solely the RLA and gate parameters using chunked long-context data.
Limitations and Extensions
CoMeT currently maintains exclusively intrinsic model memory and does not incorporate episodic memory, test-time training, or connectivity to external knowledge (such as RAG retrieval or notebook memory). Future research priorities include:
- Dynamic adjustment of memory capacities based on information salience.
- Hierarchical or topic-sensitive gating strategies.
- Integration with external retrieval or episodic augmentation modules.
- Adapting the architecture for multimodal contexts, including video+text scenarios.
7. Summary and Positioning
CoMeT presents a principled methodology for efficient long-context modeling in Transformer-based architectures, unifying a gated global memory and a FIFO temporary memory to achieve linear time complexity, constant memory requirements, and empirical robustness across both synthetic and real-world scenarios. Its plug-and-play character and minimal fine-tuning regime facilitate practical adoption in existing models facing the challenges of ultra-long-range sequence processing (Zhao et al., 2 Feb 2026).