Temporal Graph Networks Overview

Updated 17 January 2026

Temporal Graph Networks (TGNs) are deep learning architectures that model dynamic graphs by capturing temporal interactions and evolving node states.
They employ recurrent memory updates, time encoding, and attention-based aggregation to process sequential graph events for tasks like link prediction and recommendations.
Innovations such as adaptive neighborhood sampling and efficient training protocols enable TGNs to achieve state-of-the-art performance in domains ranging from social networks to finance.

Temporal Graph Networks (TGNs) are deep learning architectures designed to model dynamic graphs—where nodes, edges, and their attributes evolve in continuous time. TGNs formalize dynamic relational systems as sequences of discrete-time or continuous-time interaction events, enabling node-level and edge-level predictions that respect temporal order and context. Established through foundational works such as Rossi et al. ("Temporal Graph Networks for Deep Learning on Dynamic Graphs" (Rossi et al., 2020)), TGNs combine recurrent node memory, time encoding, and graph-based aggregation to address both structural and temporal complexity in diverse domains including social networks, recommender systems, and time-resolved financial platforms.

1. Formulation of Temporal Graphs and the TGN Framework

Dynamic graphs are typically represented in continuous-time event format, where each event is a tuple $(i, j, t, e_{ij}(t))$ , with nodes $i$ , $j$ interacting at time $t$ and edge features $e_{ij}(t)$ . For each node $i$ , TGNs maintain a time-indexed memory $s_i(t) \in \mathbb{R}^{d_\text{mem}}$ that summarizes its historical context. Upon an interaction, a raw message is constructed—typically by concatenating pre-event memories, time encodings (such as sinusoidal or learned linear forms), and edge features: $m_i(t) = [ s_i(t^-) \parallel s_j(t^-) \parallel \text{TE}(\Delta t) \parallel e_{ij}(t) ]$ where $\Delta t = t - \text{last\_update}(i)$ and TE(·) denotes time encoding. Aggregated messages are written into memory via recurrent update cells (GRU, RNN). Embeddings are computed either directly from node memory or by aggregating over temporally-aware neighborhoods using schemes such as attention or sum/GCN aggregators (Rossi et al., 2020, Verma et al., 2023, Kim et al., 2024).

2. Core Modules: Memory, Neighborhood Sampling, and Aggregation

TGNs are distinguished by three interacting modules: memory update, neighborhood sampling, and aggregation functions.

Memory update: Nodes use event-driven recurrent modules (typically GRUs) to maintain hidden states, only updating at event times and remaining constant otherwise; this efficiently encodes long-term node and edge state (Rossi et al., 2020, Yang et al., 2024).
Neighborhood sampling: For embedding, nodes sample $k$ -hop temporal neighborhoods, most often the $k$ most-recent neighbors sorted by timestamp (most-recent sampling), which empirical evidence shows outperforms uniform or random selection (Yang et al., 2024). Sampling complexity is minimized via temporal-CSR data formats.
Aggregation: Node embeddings are produced by aggregating sampled neighbor memories—with attention aggregators consistently yielding superior predictive metrics compared to MLP-mixer or simple sum (Yang et al., 2024, Verma et al., 2023, Kim et al., 2024). Attention scores can be computed through scalar MLPs over concatenated memory and edge features, followed by learned output projections.

3. Temporal Encoding and Inductive Bias

Time encoding is fundamental in TGNs, injecting recency sensitivity and temporal semantics into the learning pipeline. Two common implementations are sinusoidal encoding: $\text{TE}(\Delta t)_{2k} = \sin(\Delta t / 10{,}000^{2k/d})\,,\quad \text{TE}(\Delta t)_{2k+1} = \cos(\Delta t / 10{,}000^{2k/d})$ and learned linear encoding: $\text{TE}(\Delta t) = W_t \Delta t + b_t$ (Kim et al., 2024). These encodings are concatenated into every message, enabling the model to assess both time gaps and event frequency within dynamic contexts.

4. Expressivity, Limitations, and Enhancements

Standard TGNs are expressive but possess limitations. Notable results (Souza et al., 2022, Tjandra et al., 2024):

Expressivity hinges on injectivity of aggregation and memory update functions. Mean pooling or proportion-invariant (softmax) attention is non-injective, failing to distinguish certain temporal computation trees.
TGNs cannot represent moving averages, persistent forecasting, or general autoregressive models over messages unless source-target identification (ID encoding) is directly incorporated into message construction (as in TGNv2) (Tjandra et al., 2024). Augmenting messages with sender/receiver IDs enables exact representation and recall of ordered interaction histories.
Injective TGNs can match the temporal Weisfeiler–Leman (WL)-test upper bound in distinguishability for anonymous graphs; relative positional encodings further subsume walk-aggregate TGNs (Souza et al., 2022).

Trajectory encoding TGNs (TETGN) leverage learnable, auto-expandable node IDs as positional features and perform message passing over these IDs, combining them with standard TGN streams via multi-head attention. This unifies the strengths of anonymous and non-anonymous models for both transductive and inductive settings (Xiong et al., 15 Apr 2025).

5. Implementation, Training Protocols, and Efficiency Techniques

TGNs operate in mini-batch chronological order, preserving temporal causality. Training objectives vary: link prediction is optimized by a negative-sample binary cross-entropy, while recommendation tasks often use Bayesian Personalized Ranking (BPR) loss (Kim et al., 2024).

Optimization typically uses Adam, learning rates $[10^{-4}, 10^{-3}]$ , and standard dimensions $d_\text{mem}, d_\text{node}, d_\text{time}$ set by empirical ablation (Kim et al., 2024, Verma et al., 2023).
Graph sampling and CSR conversion can become bottlenecks; transformer-based TGNs (TF-TGN) amortize these with highly parallel C++/OpenMP routines and integrate flash-attention kernels, enabling acceleration by factors up to $1{,}466\times$ on massive graphs (Huang et al., 2024).
Continuous training approaches (HAL) convert idle computation due to sparse supervision into productive learning by inserting history-averaged pseudo-labels per batch—reducing gradient variance and accelerating convergence up to $15\times$ without modifying architecture (Panyshev et al., 18 May 2025).

6. Empirical Results, Ablation Studies, and Design Guidelines

Quantitative benchmarks demonstrate TGNs achieve state-of-the-art performance. Example: on MovieLens, TGN attains Recall@20 of $0.2211$ (+15.4% over DyRep), and RetailRocket Recall@20 of $0.3610$ (+61.2% over SASRec) (Kim et al., 2024). Attention aggregation modules consistently yield the highest recall and precision metrics; memory (GRU-based) modules are critical for long-term dependency tracking, and single-layer TGN architectures suffice when memory is present. Ablations confirm that removing memory or attention results in substantial performance degradation (up to 4 points in AP/AUC) (Verma et al., 2023, Rossi et al., 2020). Most-recent neighbor sampling plus attention aggregation and dynamic/static memory selection based on dataset repetition yields optimal runtime-accuracy tradeoffs (Yang et al., 2024).

7. Extensions: Adaptive Neighborhoods, Explanation, and Evaluation Metrics

Recent advances include adaptive neighborhood selection, plug-and-play modules (SEAN), and volatility-aware training:

SEAN introduces representative neighbor selection (semantic + occurrence-aware attention, diversity penalties) and temporal-aware aggregation (pruned LSTM, outdated-information decay), yielding robust gains in link prediction and AUROC—especially under noisy or expanded-hop conditions (Zhang et al., 2024).
Built-in explanation frameworks (TGIB) exploit information bottleneck theory by stochastically gating event subgraphs based on relevance, producing explanations and predictions end-to-end; TGIB achieves higher sparsity-accuracy scores and subsecond explanation latency compared to post-hoc explainers (Seo et al., 2024).
Traditional instance-based evaluation metrics (AP/AUC) fail to capture volatility clustering in temporal errors. Volatility Cluster Statistic (VCS) and its regularized training lead to more temporally uniform error distributions, an important axis for model selection in latency-sensitive settings (Su et al., 2024).

References

Rossi et al., "Temporal Graph Networks for Deep Learning on Dynamic Graphs" (Rossi et al., 2020)
Tjandra et al., "A Temporal Graph Network Framework for Dynamic Recommendation" (Kim et al., 2024)
Yang et al., "Towards Ideal Temporal Graph Neural Networks" (Yang et al., 2024)
Seo et al., "Self-Explainable Temporal Graph Networks based on Graph Information Bottleneck" (Seo et al., 2024)
Wang et al., "Provably expressive temporal graph networks" (Souza et al., 2022)
Huang et al., "Retrofitting Temporal Graph Neural Networks with Transformer" (Huang et al., 2024)
Li et al., "Towards Adaptive Neighborhood for Advancing Temporal Interaction Graph Modeling" (Zhang et al., 2024)
Tjandra et al., "Never Skip a Batch: Continuous Training of Temporal GNNs via Adaptive Pseudo-Supervision" (Panyshev et al., 18 May 2025)
Zhang et al., "Trajectory Encoding Temporal Graph Networks" (Xiong et al., 15 Apr 2025)
Zeng et al., "Temporal-Aware Evaluation and Learning for Temporal Graph Neural Networks" (Su et al., 2024)
Wang et al., "Enhancing the Expressivity of Temporal Graph Networks through Source-Target Identification" (Tjandra et al., 2024)