Query-driven Temporal Memory Module

Updated 15 November 2025

Query-driven Temporal Memory modules are architectures that boost temporal reasoning by enabling selective memory access via explicit queries.
They utilize structured storage and ANN retrieval to efficiently handle long-range dependencies in applications like knowledge graphs and video segmentation.
Empirical results show QTM offers significant gains in tasks such as temporal QA, video saliency, and semantic segmentation through task-adaptive memory updates.

Query-driven Temporal Memory (QTM) refers to a family of architectural modules across contemporary machine learning subfields, designed to enhance temporal reasoning and information retention under task-driven constraints. QTM architectures operationalize selective temporal memory access and update, guided by explicit neural or symbolic queries, for improved reasoning, temporal credit assignment, and sequence modeling. The concept appears in recent developments spanning temporal knowledge graph augmentation for LLMs (Tan et al., 15 Oct 2025), differentiable memory for video saliency and segment understanding (Lin et al., 13 Nov 2025), multimodal foundation models (Diao et al., 9 Feb 2025), and memory-augmented semantic segmentation (Wang et al., 2021). Despite domain specificity, these modules share unifying principles of task-adaptive memory indexing, query-driven retrieval and update, and structured temporal constraint enforcement.

1. Core Principles and Motivation

QTM modules address the inability of standard neural networks and foundation models to process extended sequences or multi-hop temporal dependencies due to context window limitations, uniform attention, or memory bottlenecks. The defining features are:

Query-driven access: Memory interaction is guided by explicit queries—ranging from natural language questions to transformer-learned queries or segment-specific vectors—rather than uniform or content-agnostic access.
Temporal selectivity: Only segments, events, or memory traces relevant to the query and current temporal context are accessed or updated, enabling efficient use of limited memory resources.
Structured storage: Memories are typically structured as pools, buffers, or sets of key-value pairs, indexed by embeddings, time, type, and query-relevance.
Integration with task pipelines: QTM modules are interleaved with key stages in reasoning, segmentation, or retrieval pipelines, allowing both recall of prior solutions and continual adaptation from new evidence.

This approach enables improved temporal faithfulness, synchronization across multiple entities or modalities, and the effective reuse of reasoning traces or segment features in long-range or multi-hop inference tasks.

2. Architectures and Data Structures

2.1 Temporal Reasoning with Structured Experience Memory

In temporal knowledge graph-based reasoning for LLMs (Tan et al., 15 Oct 2025), QTM, referred to as "Experience Memory," maintains a global pool $E_{\text{pool}}$ of entries with rich annotations:

Field	Content	Notes
$q_j$	(Sub-)question text	Literal question string
$I_j$	Indicator (triple/quadruple template with time variables)	Encodes temporal type/variables
$a_j$	Verified answer (entity/timestamp)	Singular value
$P_j$	Grounding temporal path (sequence of TKG quadruples)	Supports forward-chaining checks
$t_j$	Temporal-type label (e.g., beforeNlast, afterFirst)	Categorizes operator
$[e_{q_j},e_{I_j}]$	Dense vector embeddings for question and indicator	For efficient ANN retrieval
hit_count $_j$	Usage frequency for freshness ranking	Promotes frequently helpful hits

Both embeddings are indexed for joint Approximate Nearest Neighbor (ANN) retrieval, filtered by temporal type and re-ranked by hybrid similarity. A fixed-size Least Recently Used (LRU) buffer caches the most active entries for rapid access.

2.2 Neural Query Memory for Video and Multimodal Understanding

In visual and multimodal tasks, QTM implementations replace prompt or key-value memory banks with dual-stacks of learned query vectors:

Frame-level queries ( $Q_f \in \mathbb{R}^{N_f \times d}$ ): Sparse vectors probing immediate frame features for saliency or importance.
Video-level queries ( $Q_v \in \mathbb{R}^{N_v \times d}$ ): Compact, trainable memory slots that accumulate and refine temporal context, propagated over time through learnable updates (Lin et al., 13 Nov 2025).

A memory encoder extracts per-frame features, which are cross-attended with video queries to blend new evidence with persistent context. This approach enables full differentiability, prompt-free operation, and eliminates the need for large key/value external memories.

Segment Selector and Memory Buffer: For multimodal foundation models (Diao et al., 9 Feb 2025), QTM comprises a segment selection engine (driven by query/feature similarity), a fixed-capacity buffer $M \in \mathbb{R}^{C \times d_m}$ for relevant features, and a gated updater (GRU or additive) for temporal persistence.

2.3 Memory Attention in Video Segmentation

In sequence labeling tasks such as semantic segmentation, the QTM block maintains a finite memory of deep feature maps from the last $T$ video frames (Wang et al., 2021). These maps are aggregated using self-attention between the query (current frame) and the memory bank (recent past), enabling temporal context integration without optical flow estimation.

3. Retrieval, Update, and Temporal Constraint Algorithms

3.1 Memory Read (Retrieval)

Memory recall is performed via embedding-based ANN search over the memory pool or buffer, restricted by query context (e.g., temporal type, learned similarity). For experience memory:

$\{\,e_j\}_{j=1}^K = \mathrm{ANN}\bigl(\{e_{q_j},e_{I_j}\},\;\mathrm{filter}: t_j = t,\;K\bigr)$

$\text{Score}_j = \alpha_{\mathrm{sim}}\cos\left(e_q, e_{q_j}\right) + \alpha_{\mathrm{hit}}\,\text{hit\_count}_j\quad (\alpha_{\mathrm{sim}}+\alpha_{\mathrm{hit}}=1)$

The top $W_{\mathrm{exp}}$ high-scoring entries are used as exemplars for the current pipeline stage (classification, decomposition, etc.). Retrieval is type-restricted to enforce operator-awareness and context alignment.

3.2 Memory Write (Update)

After successful inference, the new experience is committed to the memory pool:

$E_{\text{pool}} \leftarrow E_{\text{pool}}\,\cup\,\bigl\{q, I, a, P, t, [e_q, e_I]\bigr\}$

For neural QTM (in video/multimodal applications), video-level queries are updated via:

$Q_v(t+1) = Q_v(t) + \text{FFN}\Bigl[\text{SA}\big(\text{CA}(Q_v(t), F_m(t), F_m(t))\bigr)\Bigr]$

where $F_m(t)$ is the encoded memory feature, CA is cross-attention, SA is self-attention, and FFN is a feedforward MLP.

3.3 Temporal Constraints and Path Validation

When reusing prior memory, QTM enforces structured validity:

Local monotonicity:

$t_1 \leq t_2 \leq \ldots \leq t_l$

along a path $P = [(e_1, r_1, e_2, t_1), ..., (e_l, r_l, e_{l+1}, t_l)]$
Global monotonicity across segments:

$\max_{(\cdot,\cdot,\cdot,t) \in P_i} t \leq \min_{(\cdot,\cdot,\cdot,t) \in P_{i+1}}$
Indicator-driven co-constraint ensures that all logical requirements of the prior experience are implied by the current query's constraints.

Memory entries failing these checks are not reused, preserving both efficiency and correctness in temporal multi-hop or multi-entity problems.

4. Integration with Upstream Pipelines

4.1 Temporal Reasoning Pipelines

In temporal KG reasoning, QTM is accessed at crucial decision points in the Tree of Time (ToT) decomposition: temporal-type classification, hierarchical decomposition, seed-entity selection, and toolkit selection. At each node, an API $\mathrm{MemoryLookupAndTest}(q_i,I_i,E_{\text{pool}})$ determines if assignment and evidence from memory can satisfy all current constraints. If so, full external retrieval and reasoning are bypassed.

4.2 Video and Multimodal Inference

QTM replaces manual prompts and external transformer-memories with two sets of learned queries per frame, generating embeddings fed to a decoder (e.g., mask decoder (Lin et al., 13 Nov 2025)). All updates are handled via end-to-end backpropagation from the task objective, and long-range temporal dependencies are captured through propagation of video-level query vectors without reliance on extensive external memory.

In multimodal systems, QTM is a plug-in front end to existing encoders: it filters, buffers, and propagates only the most query-relevant frames and segments to the downstream MFM, using iterative scoring (distinctiveness, query similarity) and a small trainable GRU or additive memory step (Diao et al., 9 Feb 2025).

4.3 Semantic Segmentation

The TMA (QTM) block in video segmentation sits after the backbone encoder, aggregating memory-attended features from past frames and fusing them channel-wise with current features for pixel-level classification (Wang et al., 2021). The memory bank updates roll forward sequentially, with no learned gating or explicit memory overwrite.

5. Experimental Results and Empirical Impact

QTM modules confer measurable improvements in temporal reasoning, sequence modeling, and saliency detection:

Temporal QA and KG Reasoning: On the MultiTQ benchmark (GPT-4-o-mini backbone), QTM delivers a +4.4% absolute gain in Hits@1 overall (64.2% vs. 59.8%), with multi-entity/multi-hop lift up to +14.2% (40.3 → 26.1 for multi-entity questions) (Tan et al., 15 Oct 2025). Memory capacity down to $\sim$ 100 entries had negligible impact, below which performance degradation begins.
Multimodal Video Tasks: In MFM-based pipelines, QTM improved performance across nine SOTA models:
- AVQA: up to +4.69% (DG-SCT) accuracy gains on MUSIC-AVQA v2.0
- Video captioning: +6.82 CIDEr (Git on MSR-VTT)
- Video-text retrieval: up to +2.8 Recall@10 (VINDLU) (Diao et al., 9 Feb 2025)
- Ablation studies confirmed that both visual and audio memory branches are critical; omitting either results in significant accuracy loss.
RGB-D Salient Video Object Detection: QTM in SAM-DAQ consistently outperforms prior methods across three benchmarks, with gains attributed to prompt-free learning and joint query/memory embeddings (Lin et al., 13 Nov 2025).
Semantic Segmentation: TMA networks (QTM) reach 80.3% mIoU on Cityscapes and 76.5% on CamVid with ResNet-50 backbone, competitive with costly optical flow-based methods (Wang et al., 2021).

6. Implementation and Practical Considerations

Common hyperparameters and recipes include:

Domain	Buffer Size / Capacity	Query Dim.	Retrieval Candidates	Notable Weights/Tricks
Temporal KG/LLM	200 entries (LRU)	varies	$K=10$ –$20$	$\alpha_{sim}=0.6, \alpha_{hit}=0.4$
Video Saliency (SAM-DAQ)	$N_f = 30$ , $N_v=8$	$d=64$	—	All-proj dropout=0.1, $W_{exp}=10$
Multimodal FM	$C=16$ –$32$	$k=8$ –$12$	—	InfoNCE and downstream loss
Segmentation	$T=4$ frames	—	—	Simple rolling window; no gating

Insertion Point: QTM modules can often be retrofitted to state-of-the-art models as shallow pre-encoders, with only light retraining required.
Overhead: QTM incurs 5–15% runtime increase in multimodal applications, offset by a 6–12 $\times$ input size reduction via selective segment retention.
Differentiability: Modern QTM instantiations are fully differentiable and trained end-to-end from task-level loss, without explicit memory supervision.
Pluggability: Templates for segment selection, buffer update, and memory readout can be reused across domains with minimal adaptation.

Memory size and selection thresholds represent the main speed/accuracy trade-off axis; reducing the number of stored entries or queries below optimal values impacts retrieval quality or temporal coverage.

7. Relation to Broader Memory and Attention Mechanisms

QTM occupies a distinctive space between general neural memory systems (e.g., Neural Turing Machines, Transformer global attention) and fixed-window or optical-flow frameworks. Its task-driven, query-dependent retrieval and strong temporal structure enforcement differentiate it from content-agnostic memory or generic attention blocks. All contemporary deployments enforce explicit semantic or temporal constraints at retrieval, and training is optimized for faithfulness and task-efficient reuse, rather than coverage or raw capacity.

A plausible implication is that QTM provides a template for robust, generalizable selective long-horizon memory not only in reasoning or perception, but across any domain where temporal or sequential generalization must be achieved with fixed model resources.

In summary, Query-driven Temporal Memory modules operationalize efficient, context-aware temporal recall and update across language, vision, and multimodal domains, underpinning significant advances in LLM temporal reasoning, memory-augmented video processing, and context-efficient multimodal learning.

Markdown Report Issue Upgrade to Chat

References (4)

MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning (2025)

SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection (2025)

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding (2025)

Temporal Memory Attention for Video Semantic Segmentation (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Query-driven Temporal Memory (QTM) Module.