Papers
Topics
Authors
Recent
Search
2000 character limit reached

TCAM: Track and Caption Any Motion

Updated 19 December 2025
  • TCAM is a motion-centric video understanding framework that fuses dense point tracking, vision–language contrastive learning, and spatial grounding to autonomously describe and localize motion.
  • It leverages a novel Motion-Field Attention mechanism and multi-head cross-attention to integrate temporal and spatial features, ensuring accurate captioning even under occlusion and rapid movement.
  • TCAM achieves state-of-the-art results on benchmarks like MeViS and HC-STVG by combining joint global and spatial losses with robust motion trajectory analysis.

Track and Caption Any Motion (TCAM) is a motion-centric video understanding framework that autonomously discovers and spatially grounds multiple motion patterns in video, describing each with natural language in a completely query-free manner. TCAM fuses dense point tracking, vision–language contrastive learning, and spatial grounding via a novel Motion-Field Attention mechanism. By leveraging motion dynamics instead of static appearance, it addresses video understanding tasks in the presence of occlusion, camouflage, and rapid movement, achieving state-of-the-art results on retrieval and grounding benchmarks (Galoaa et al., 11 Dec 2025).

1. System Architecture and Algorithmic Components

TCAM processes video inputs through parallel appearance and motion pipelines and unifies their representations for motion discovery and description grounding.

  • Input Representation: For a video V={It}t=1T\mathbf V = \{I_t\}_{t=1}^T, appearance features are extracted by a frozen CLIP-ViT per frame:

ftvis=CLIP-ViT(It),evideo=TemporalPool({ftvis})\mathbf f_t^{vis} = \mathrm{CLIP\text{-}ViT}(I_t),\quad \mathbf e_{\rm video} = \mathrm{TemporalPool}(\{\mathbf f_t^{vis}\})

yielding a 512-dimensional video embedding.

  • Motion Trajectories: Dense 2D point trajectories are obtained using CoTracker3, represented as {pj,t}\{p_{j,t}\} for NtN_t tracks.
  • Motion-Field Attention (MFA): MFA fuses each trajectory {pj,t}\{p_{j,t}\} with visual features into a 512-D motion descriptor mj\mathbf m_j. This involves three MLPs for per-frame position, velocity, and occlusion encoding, followed by a “spatial” transformer (jointly attending over tracks and visual context at each time), a “temporal” transformer (aggregating features over TT steps for each track), and a linear CLIP-projection:

mj=MFA({pj,t},{ftvis})\mathbf m_j = \mathrm{MFA}(\{p_{j,t}\}, \{\mathbf f_t^{vis}\})

  • Text Bank and Caption Embedding: All unique ground-truth descriptions dkd_k are embedded using CLIP_text(dk)\mathrm{CLIP\_text}(d_k).
  • Inference Workflow: TCAM enables query-free discovery by calculating cosine similarities between evideo\mathbf e_{\rm video} and text bank embeddings, performing percentile thresholding, and cross-attention-based spatial grounding.

The overall pipeline aligns the temporal visual signal to a large text bank, spatially grounds selected descriptions via cross-attention with motion descriptors, and produces per-trajectory (and optionally per-pixel) semantic masks.

2. Training Objectives and Vision–Language Alignment

TCAM is trained end-to-end using a joint global and spatial loss:

  • Global Video–Text Alignment Loss (Lglobal\mathcal{L}_{\mathrm{global}}):

This multi-positive InfoNCE contrastive objective aligns videos and their set of correct descriptions:

Lglobal=1Bb=1BlogkPbexp(sbk/τ)k=1Nbexp(sbk/τ)\mathcal L_{\mathrm{global}} = -\frac{1}{B} \sum_{b=1}^B \log \frac{ \sum_{k\in\mathcal P_b} \exp( s_{bk}/\tau ) }{ \sum_{k=1}^{N_b} \exp (s_{bk}/\tau) }

where sbks_{bk} denotes the cosine similarity between the bb-th video and kk-th text embedding, and τ=0.1\tau=0.1.

  • Fine-Grained Spatial Grounding Loss (Lspatial\mathcal{L}_{\mathrm{spatial}}):

    • Diversity: Ldiversity=max(0,0.1std(ri))\mathcal L_{\mathrm{diversity}} = \max(0, 0.1 - \mathrm{std}(\mathbf r_i))
    • Sparsity: Lsparsity=0.01ri1\mathcal L_{\mathrm{sparsity}} = 0.01 \|\mathbf r_i\|_1
    • Alignment: Margin ranking loss between positive (inside-mask) and negative (outside-mask) tracks:

    Lalignment=1T+TjT+jTmax(0,γri,j+ri,j), γ=0.2\mathcal L_{\mathrm{alignment}} = \frac{1}{|\mathcal T^+| |\mathcal T^-|} \sum_{j\in\mathcal T^+} \sum_{j'\in\mathcal T^-} \max(0, \gamma - r_{i,j} + r_{i,j'}),\ \gamma=0.2

  • Total Loss: Combined as

Ltotal=λ1Lglobal+λ2Lspatial\mathcal{L}_\text{total} = \lambda_1\,\mathcal{L}_\text{global} + \lambda_2\,\mathcal{L}_\text{spatial}

with λ1=0.9\lambda_1=0.9, λ2=0.1\lambda_2=0.1.

3. Query-Free Motion Discovery and Multi-Expression Handling

During inference, TCAM autonomously discovers multiple salient motions and their language descriptions without any external queries.

  • Text Bank Utilization: A large text bank, precomputed from datasets (e.g., MeViS, YouTube-VOS, ActivityNet), captures the diversity of possible natural language expressions.
  • Description Selection: The system computes similarities sks_k between evideo\mathbf e_{\rm video} and all text bank embeddings, then thresholds (e.g., top-70th percentile) or gathers top-scoring descriptions via multi-head cross-attention. The “multi-head” mechanism discovers Nd4N_d \approx 4–$5$ relevant expressions per video.
  • Spatial Grounding: Each retrieved description is grounded to specific motion tracks by cross-attention:

ri=CrossAttention(eitext,{mj})\mathbf r_i = \mathrm{CrossAttention}(\mathbf e_i^{text}, \{\mathbf m_j\})

Output tracks may be rendered as masks for per-frame, per-description region segmentation.

4. Training Protocol, Datasets, and Model Capacity

TCAM is trained and evaluated primarily on the MeViS dataset (2,006 videos, 28.6K motion expressions, 443K segmentation masks), with additional evaluation on HC-STVG.

  • Optimization: AdamW, initial learning rate 10410^{-4}, OneCycleLR, τ=0.1\tau=0.1, for 500 epochs using PyTorch DDP on 4 NVIDIA V100 GPUs.
  • Model Dimensions: MFA hidden dimension d=256d=256 with 16 attention heads; Nt=24×24=576N_t=24\times24=576 dense point tracks per video.
  • Loss Weighting: Spatial grounding loss weight λ=0.1\lambda=0.1.

Larger text banks increase discovery diversity and coverage, with 150K expressions yielding 68.5% coverage and 84.7% precision.

5. Experimental Results and Benchmarks

TCAM attains high performance on both retrieval and spatial grounding metrics. Key results include:

Setting / Metric Value Dataset
Video→Text R@1 58.4% (no query) MeViS
Spatial grounding J ⁣+ ⁣FJ\!+\!F 64.9% MeViS
Expressions Discovered / video 4.8 MeViS
Coverage 68.5% MeViS (adaptive)
Precision 84.7% MeViS (adaptive)
m_vIoU 42.3% HC-STVG
m_tIoU 52.8% HC-STVG
[email protected] 38.7% HC-STVG

Ablation studies emphasize the importance of MFA, the temporal encoder, multi-positive loss, and multi-head cross-attention for overall system performance. Removal of any major component yields a significant decrease in both retrieval and grounding accuracy.

6. Spatial Grounding and Evaluation Metrics

TCAM grounds each caption to a subset of motion tracks, forming per-description, sparse masks. The following metrics quantify grounding accuracy:

  • Jaccard Index (IoU): J=M^Mgt/M^MgtJ = |\hat{M}\cap M_{gt}|/|\hat{M}\cup M_{gt}|
  • Boundary F-score (FF): As per segmentation benchmarks.
  • Combined Grounding Score: J ⁣+ ⁣FJ\!+\!F
  • Tube IoU (HC-STVG): Spatio-temporal intersection metrics (m_vIoU, m_tIoU, [email protected]).

These metrics capture mask similarity per frame and across time, reflecting how accurately discovered expressions are spatially localized.

7. Limitations and Prospective Developments

TCAM’s reliance on pointwise motion trajectories entails inherited limitations:

  • Robustness to Scene Discontinuities: Abrupt scene cuts or point-of-view shifts can disrupt trajectory identity, weakening spatial association between discovered motions and language.
  • Point-Tracking Failures: Errors or dropouts in point tracking directly propagate to failures in semantic grounding.
  • Future Improvements: Integration of temporal-consistency modules, explicit object re-identification, and end-to-end fine-tuning of CLIP may enhance robustness, especially for disorderly motion or long-range temporal reasoning.

A plausible implication is that future architectures combining dense motion correspondence with open-vocabulary vision–LLMs and strong temporal association will further improve autonomous motion discovery and compositional video understanding.


For a detailed methodological exposition, refer to "Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos" (Galoaa et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Track and Caption Any Motion (TCAM).