Spatio-Temporal Bidirectional Scanning

Updated 7 February 2026

Spatio-temporal bidirectional scanning mechanisms are deep learning architectures that jointly capture spatial structures and temporal dependencies using forward and backward context.
They employ spatial scanning techniques such as graph convolutions, attention, and convolutions to extract local features, which are fused with bidirectional temporal models for comprehensive data representation.
Applications in traffic forecasting, sensor imputation, video anomaly detection, and human motion estimation have demonstrated state-of-the-art performance and improved context aggregation.

A spatio-temporal bidirectional scanning mechanism is a class of deep learning architecture that jointly models spatial and temporal dependencies by “scanning” input data both along spatial structures (such as graphs or grids) and across time, explicitly in both forward and backward temporal directions. Such mechanisms underpin recent advances in sequence modeling for structured spatio-temporal domains, including multimodal traffic forecasting, sensor data imputation, anomaly detection in video, human motion estimation, and infrared target detection. The bidirectional component enables information to propagate from both past and future contexts at each step or node, while the spatial component leverages structural locality via convolution, graph attention, or sequential state-space traversals.

1. Architectural Principles and General Formulation

In spatio-temporal bidirectional scanning, input data is organized with explicit spatial structure (e.g., a graph, spatial grid, or skeleton) and a temporal sequence. Let $X \in \mathbb{R}^{T \times N \times F}$ represent $T$ time steps, $N$ spatial locations, and $F$ features per location. The mechanism interleaves:

Spatial scanning: Each frame (time slice) is scanned across its spatial structure using methods such as graph convolution, attention, or permuted sequences (e.g., joint orderings), producing intermediate spatial representations $O_S(X^t)$ .
Bidirectional temporal scanning: Sequences of spatial embeddings $\{O_S(X^t)\}_{t=1}^T$ are processed by temporal models capable of capturing both past and future contexts. This is implemented using two parallel scanning pathways: a forward temporal model (e.g., RNN, TCN, SSM), and a backward model operating in the reverse temporal order. Their outputs are fused, typically by addition or concatenation.
Fusion: The final per-position representations encode both local and global spatial dependencies, and complete temporal context from both directions.

This methodology is instantiated in multiple modalities, with the spatial operator tailored to the application domain (e.g., graph sparse attention in traffic multimodal systems (Zhang et al., 2024), convolutional encoders for video (Lee et al., 2018), graph attention for sensor networks (Wang et al., 2023)).

2. Spatial Scanning Mechanisms

Spatial scanning is essential for modeling spatial dependencies within each time slice:

Graph Sparse Attention (GSA): For multimodal traffic data, a joint block-diagonal adjacency matrix $G_M$ encodes spatial neighborhoods within each mode. Self-attention scores $A_M$ are computed as

$A_M = \frac{QK^T}{\sqrt{d}}$

and fused with $G_M$ via elementwise product and softmax row normalization. The resulting $G_A$ is used in a two-layer GCN to extract local spatial features. To capture global context, “Top-U” sparse attention keeps only the largest $U$ affinities per node, mitigating the “long-tail” problem in vanilla self-attention. The sum of local and global outputs forms the spatial latent $O_S(X)$ (Zhang et al., 2024).

Graph Attention Networks (GAT): For sensor grids, spatial attention computes learned coefficients over each node's neighbors, thus capturing spatial correlation structure at each step (Wang et al., 2023).
Convolutional Encoders: Video and sensor autoencoders employ 2D convolutional filters over space-time patches to extract local joint features, spanning several sensors and contiguous time steps (Asadi et al., 2019, Lee et al., 2018).

Some approaches (e.g., human pose estimation) alternate global scans in a canonical joint ordering with “local” scans along anatomically relevant paths defined by permutation matrices (Huang et al., 2024).

3. Bidirectional Temporal Scanning Strategies

Bidirectional temporal scanning is realized via various architectures:

Bidirectional Temporal Convolutional Networks (BiTCN): Two stacked temporal convolutional modules are run: one in the forward direction, another in reverse. For each spatial embedding sequence $X$ , the outputs $H^F(X)$ (forward) and $H^B(X)$ (backward) are summed:

$O_{BT}(X) = H^F(X) + H^B(X)$

Each output position thus aggregates receptive field context from both the past and future, up to the maximal dilation (Zhang et al., 2024).

Bidirectional RNNs (BiLSTM, BiGRU): Parallel forward and backward recurrent units scan the temporal sequence, producing fused per-step embeddings $[h_t^f ; h_t^b]$ . This effectively doubles the context captured at each position (Asadi et al., 2019, Wang et al., 2023).
Bidirectional State Space Models (SSM): In PoseMamba, discrete-time state-space models perform forward and backward scans on temporally or spatially ordered input sequences. The two state streams are concatenated and projected back to the output domain, yielding bidirectional receptive fields with linear complexity in sequence length (Huang et al., 2024).
Bidirectional ConvLSTM: For video generation and anomaly detection, ConvLSTM cells process past frames in forward order and future frames in reverse, then merge their hidden and cell states to inform reconstruction or prediction (Lee et al., 2018).

Bidirectionality provides each time step with access to context on both sides, enhancing the temporal feature correlation and improving expressivity.

4. Integration, Fusion, and Modularity

Integration of spatial and temporal scans follows architecture-specific strategies:

Shared vs. Unique Encoders: In multimodal systems, a “share–unique” module factorizes temporal modeling into a shared BiTCN across all modes, capturing inter-modal correlation, and unique BiTCNs per mode for intra-modal patterns. Outputs are concatenated and summed (Zhang et al., 2024).
Global and Local Fusion: PoseMamba fuses bidirectional global and local spatial–temporal SSM scans. Four bidirectional scans (forward/backward in space, forward/backward in time) are performed on both canonical and permuted (local) orders; their outputs are combined via linear projections and summed (Huang et al., 2024).
Feature Concatenation: BIRD concatenates the original CNN feature, backward-aggregated, and forward-aggregated features before final prediction (Luo et al., 21 Aug 2025).
Residual Connections: Autoencoder designs employ residual links passing spatial feature maps from encoder convolution to the decoder’s fully connected prediction, stabilizing training (Asadi et al., 2019).

The modularity of these designs permits adaptation to multiple kinds of spatial structure, feature types, and domain-specific inductive biases.

5. Application Domains and Empirical Findings

Bidirectional spatio-temporal scanning mechanisms have demonstrated empirical success in:

Domain	Architecture and Mechanism	Reported Performance
Multimodal traffic forecasting	GSA + BiTCN, share–unique module	State-of-the-art on three real datasets (Zhang et al., 2024)
Sensor imputation	CNN + BiLSTM autoencoder	MAE = 6.8, RMSE = 13.0 for Bay Area highways (Asadi et al., 2019)
Human pose estimation	BiGL Spatio-Temporal SSM (PoseMamba)	State-of-the-art, reduced computational cost (Huang et al., 2024)
Video anomaly detection	Bidirectional ConvLSTM generator	Competitive results on benchmark datasets (Lee et al., 2018)
Infrared small target detection	Bidirectional recursive propagation (BIRD)	>3× inference speedup, improved mAP/F1 (Luo et al., 21 Aug 2025)
Traffic data imputation	GAT + BiGRU/BiRNN (ST-GIN)	Outperforms all benchmark methods (Wang et al., 2023)

Across these domains, bidirectional temporal models consistently improve performance, attributed to their ability to model long-range dependencies and leverage global context, especially when local observations are ambiguous or missing.

6. Advantages and Generalizations

Enhanced Context Aggregation: Bidirectional scanning ensures that each position “hears” both prior and subsequent context, critical for smoothing, disambiguation, and reconstructing corrupted or occluded signals (Zhang et al., 2024, Asadi et al., 2019, Luo et al., 21 Aug 2025).
Efficient Joint Optimization: Many-to-many prediction enables simultaneous computation and joint supervision across entire clips, improving temporal consistency and computational efficiency (each frame processed twice only) (Luo et al., 21 Aug 2025).
Structural Flexibility: The modular separation of spatial and temporal scanning allows replacement or augmentation of each component for specific spatial structures (e.g., grid, graph, skeleton) and custom temporal ranges.
Broader Applicability: The methodology generalizes to any multi-node, multi-time series system where both graph priors and sequential dependency are present, including air quality monitoring and epidemiological forecasting (Zhang et al., 2024).

A plausible implication is that as sequence modeling requirements continue to scale both in length and structural complexity, bidirectional spatio-temporal mechanisms will become foundational for scalable, expressive, and domain-adaptable sequence modeling systems.

7. Limitations and Open Directions

While these mechanisms provide powerful expressivity, the following are noted:

Memory & Computational Complexity: Naive attention-based implementations may suffer from quadratic scaling; thus, sparse or SSM-based operators are preferred for longer-range modeling.
Structural Inductive Bias: Effectiveness depends on properly specifying or learning spatial structure; domain knowledge or learnable adjacency matrices are crucial.
Fusion Strategies: The optimal method for combining multiple bidirectional outputs (additive, concatenative, learned projection) remains architecture-dependent and may require tuning for different tasks (Zhang et al., 2024, Huang et al., 2024).

Continued research is focused on integrating these mechanisms with uncertainty quantification (Wang et al., 2023), adversarial learning (Lee et al., 2018), and efficient hardware deployment for large-scale dynamic systems.