Spatio-Temporal Connected Components

Updated 20 January 2026

Spatio-Temporal Connected Components (SCC) are methods that partition complex data into maximally connected, non-overlapping regions across spatial and temporal dimensions, ensuring causality.
They integrate classical topological connectivity with semantic similarity by leveraging techniques like cubical complexes and cosine-thresholded token clustering, enhancing both tracking and compression.
SCC methods facilitate applications in dynamic image analysis and video token compression by maintaining comprehensive semantic coverage while balancing computational efficiency and accuracy.

Spatio-Temporal Connected Components (SCC) are a family of methods for identifying, tracking, and compactly representing entity connectivity as it evolves across both spatial and temporal domains. In contemporary computational contexts, SCC bridges classical topological connected-component analysis in time-varying data (e.g., binary image sequences) with advanced semantic clustering strategies for token-based representation in video models. The unifying concept is the partition of complex spatio-temporal or high-dimensional data into maximally internally connected and externally disconnected regions, under constraints that may reflect geometric, temporal, or semantic structure.

1. Formal Modeling of Spatio-Temporal Connectivity

The mathematical formulation of SCC depends on the data modality. For classical image sequences, each frame $I_t=(D,B_t)$ decomposes into a spatial domain $D\subset\mathbb{Z}^2$ with foreground $B_t$ (e.g., points of interest or object mask). Components are defined using adjacency—typically 8-adjacency for foreground and 4-adjacency for background. These yield a cubical complex $Q(I_t)$ and a time-indexed sequence $\{Q_0, \ldots, Q_T\}$ .

For token-based representations in video LLMs (video LLMs), as in LLaVA-Scissor, the initial object of interest is a token set

$\mathbf{K} = \{\mathbf{k}_1, \ldots, \mathbf{k}_N\} \subset \mathbb{R}^{N \times d}$

with pairwise semantic affinity measured via cosine similarity: $S_{ij} = \frac{\mathbf{k}_i \cdot \mathbf{k}_j}{\|\mathbf{k}_i\|\;\|\mathbf{k}_j\|}$ Defining adjacency with a threshold $\tau$ leads to a graph $G=(V,E)$ in which SCCs are the usual connected components: $A_{ij} = \begin{cases} 1 & S_{ij}>\tau\ 0 & \text{otherwise} \end{cases}$ producing a partition $\{C_1, ..., C_M\}$ of all tokens.

In both modalities, temporal dynamics are encoded by stacking spatial structures along a temporal axis, creating a (space × time) cell complex. Temporal edges link homologous entities across consecutive times, constraining paths to respect causality.

2. Algorithmic Approaches for SCC Detection and Tracking

2.1 Topological SCC Algorithms in Image Sequences

A spatio-temporal cell complex $SQ[S]$ is constructed by stacking the spatial cubical complexes from each frame and connecting them with temporal cells whenever spatial overlap persists. The fundamental object is the spatio-temporal path: a chain traversing connected foreground (or background) pixels, with maximal one temporal move per time step and no cycles backward in time—enforcing a strict causality constraint not satisfied in standard persistent homology.

The SCC algorithm proceeds as follows:

Step through the space-time filtration, treating vertex appearances as births of new components and edge insertions (spatial or temporal) as possible merges.
Use representative-maps to record current component representatives and update spatio-temporal paths.
Maintain a barcode of component lifetimes $[i_\text{birth}, i_\text{death})$ .
The outer loop is $O(m)$ in the number of filtration steps, but each merge may induce $O(m)$ remappings, giving $O(m^2)$ overall cost (Gonzalez-Diaz et al., 2018).

2.2 Semantic SCC for Video Token Compression

In the LLaVA-Scissor pipeline, the SCC framework is adapted for efficient token compression in video-LMMs (Sun et al., 27 Jun 2025). The methodology decomposes into two primary algorithmic steps:

Spatial SCC (per-frame): Compute token cosine similarities, threshold at $\tau$ , discover connected components using an approximate Union-Find (path compression & rank), and average tokens within each component.
Temporal SCC (across frames): Concatenate compressed tokens from all frames, apply the SCC connectivity again to merge redundancies across time, perform an optional refinement merge by re-assigning original tokens to their nearest reduced tokens, and re-average.

Efficiency is achieved by restricting the full $O(N^2)$ connectivity calculation to a subsampled set of $N' = \min\{N, \lceil\log N/\epsilon^2\rceil\}$ representatives (with error tolerance $\epsilon$ ), providing $O(N\log N)$ runtime.

Schematic pseudocode for this process:

for each frame i:
    extract tokens t_i ∈ R^{m×d}
    adjacency A^{(i)} via cosine > τ
    clusters = approx-SCC(A^{(i)})
    t'_i[j] = average tokens in clusters[j]

stacked_tokens = concat(t'_1, ..., t'_n)
adjacency = cosine > τ among stacked_tokens
final_clusters = approx-SCC(adjacency)
refinement_merge(...)
return compressed set of tokens

3. Theoretical and Empirical Properties

SCC partitions yield non-overlapping, maximally internally connected clusters that robustly represent distinct semantic or physical entities. Properties include:

Provable partition: each input entity or token is assigned to exactly one component.
Causality: in the topological variant, spatio-temporal paths utilize at most one temporal transition per time step, with no backtracking, distinguishing this approach from classical persistent homology.
Complexity-accuracy tradeoff: For semantic SCC, sampling-based approximations allow scalability to large $N$ with minimal loss of coverage, provided $\epsilon$ is small.

In LLaVA-Scissor, the similarity threshold $\tau$ directly controls retained token budgets—in practice, $\tau\in[0.80,0.99]$ yields $5\%-50\%$ of input tokens, and empirical accuracy (benchmark “semantic coverage”) remains $>98\%$ even at $r\sim 0.05$ (Sun et al., 27 Jun 2025). Ablation studies show spatial SCC alone is dominated by the full pipeline, and random/uniform sampling performs strictly worse.

For topological SCC, birth-death barcodes quantitatively capture object persistence even as components merge or split, providing a robust tool for time-varying scientific imaging or tracking.

4. Practical Applications and Implementation Considerations

Video-LMM Token Compression

In video LLMs, SCC has been proposed as a training-free means of reducing the token set passed to the multimodal transformer backbone, cutting both prefill and decode FLOPs while retaining core semantics. Standard visual encoders (e.g., SIGLIP, CLIP) with $d\approx768$ generate per-frame features, on which SCC is applied. The pipeline is compatible with architectural choices such as LLaVA-Scissor, yielding empirical gains across VideoQA and long-video benchmarks (2% higher top-1 at constant budget versus other compression baselines).

Topological Tracking in Image Analysis

Classical SCC approaches underpin topological data analysis tasks such as component tracking in image sequences. The direct algorithm of (Gonzalez-Diaz et al., 2018) computes spatio-temporal barcodes—lifespans of all connected components—efficiently in $O(m^2)$ time, providing fine-grained structural insight for domains including cell tracking, defect analysis, and dynamical systems. The formalism extends to higher-dimensional feature tracking (e.g., tunnels, cavities) via stacked $n$ D cubical complexes.

Implementation often leverages dense adjacency matrices, neighborhood lists for sampled points, and Union-Find data structures for merge detection. Memory is bounded by the number of nodes and edges, yielding $O(m)$ usage in practical settings.

5. Comparative Advantages and Limitations

Compared to attention-score-based or naive sampling approaches, SCC possesses several advantages (Sun et al., 27 Jun 2025):

Comprehensive semantic coverage: By clustering all tokens exceeding the similarity threshold, SCC ensures no semantic region is skipped, avoiding the high-saliency bias of attention-ranked pruning.
Global connectivity: SCC links entities by global semantic affinity, allowing representation of objects moving non-locally through space and time; local neighborhood-based techniques cannot guarantee this.
Explicit redundancy control: The threshold $\tau$ provides direct control over how redundant tokens are merged, a property not present in scalar-attention-based methods.
Computational efficiency: Sampling-based approximations in SCC improve scalability over quadratic-complexity attention mechanisms, while maintaining provable partition guarantees.

Potential limitations arise from the approximation in connectivity detection when $N$ is large, with possible omission of rare cross-component edges if the sampling budget is too small; however, empirical studies report stable partitions below $\epsilon=0.05$ . The two-step strategy outperforms both spatial-only and temporal-only SCC, suggesting the importance of capturing both axes of variability.

6. Extensions and Generalizations

The SCC formalism naturally extends beyond 0-dimensional (component) tracking. Higher-dimensional topological features, such as tunnels ( $d=1$ homology) or voids ( $d=2$ ), can be tracked by augmenting the cell complex to include (space × time) cells of higher dimension and defining spatio-temporal $d$ -paths that respect similar causality constraints:

For each temporal transition, limit the use of higher-dimensional connecting cells to one per step.
Use a filtration ordering that interleaves spatial and temporal cells, pairing births and deaths of $d$ -dimensional cycles via an extension of the incremental barcode algorithm (Gonzalez-Diaz et al., 2018). This suggests possible future applications of SCC in multidimensional video, dynamic scene understanding, and scientific computing contexts where topological events of arbitrary dimension must be tracked over time.

References:

"LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs" (Sun et al., 27 Jun 2025)
"Topological Tracking of Connected Components in Image Sequences" (Gonzalez-Diaz et al., 2018)

Markdown Report Issue Upgrade to Chat

References (2)

Topological Tracking of Connected Components in Image Sequences (2018)

LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatio-Temporal Connected Components (SCC).