TCAM: Track and Caption Any Motion
- TCAM is a motion-centric video understanding framework that fuses dense point tracking, vision–language contrastive learning, and spatial grounding to autonomously describe and localize motion.
- It leverages a novel Motion-Field Attention mechanism and multi-head cross-attention to integrate temporal and spatial features, ensuring accurate captioning even under occlusion and rapid movement.
- TCAM achieves state-of-the-art results on benchmarks like MeViS and HC-STVG by combining joint global and spatial losses with robust motion trajectory analysis.
Track and Caption Any Motion (TCAM) is a motion-centric video understanding framework that autonomously discovers and spatially grounds multiple motion patterns in video, describing each with natural language in a completely query-free manner. TCAM fuses dense point tracking, vision–language contrastive learning, and spatial grounding via a novel Motion-Field Attention mechanism. By leveraging motion dynamics instead of static appearance, it addresses video understanding tasks in the presence of occlusion, camouflage, and rapid movement, achieving state-of-the-art results on retrieval and grounding benchmarks (Galoaa et al., 11 Dec 2025).
1. System Architecture and Algorithmic Components
TCAM processes video inputs through parallel appearance and motion pipelines and unifies their representations for motion discovery and description grounding.
- Input Representation: For a video , appearance features are extracted by a frozen CLIP-ViT per frame:
yielding a 512-dimensional video embedding.
- Motion Trajectories: Dense 2D point trajectories are obtained using CoTracker3, represented as for tracks.
- Motion-Field Attention (MFA): MFA fuses each trajectory with visual features into a 512-D motion descriptor . This involves three MLPs for per-frame position, velocity, and occlusion encoding, followed by a “spatial” transformer (jointly attending over tracks and visual context at each time), a “temporal” transformer (aggregating features over steps for each track), and a linear CLIP-projection:
- Text Bank and Caption Embedding: All unique ground-truth descriptions are embedded using .
- Inference Workflow: TCAM enables query-free discovery by calculating cosine similarities between and text bank embeddings, performing percentile thresholding, and cross-attention-based spatial grounding.
The overall pipeline aligns the temporal visual signal to a large text bank, spatially grounds selected descriptions via cross-attention with motion descriptors, and produces per-trajectory (and optionally per-pixel) semantic masks.
2. Training Objectives and Vision–Language Alignment
TCAM is trained end-to-end using a joint global and spatial loss:
- Global Video–Text Alignment Loss ():
This multi-positive InfoNCE contrastive objective aligns videos and their set of correct descriptions:
where denotes the cosine similarity between the -th video and -th text embedding, and .
- Fine-Grained Spatial Grounding Loss ():
- Diversity:
- Sparsity:
- Alignment: Margin ranking loss between positive (inside-mask) and negative (outside-mask) tracks:
- Total Loss: Combined as
with , .
3. Query-Free Motion Discovery and Multi-Expression Handling
During inference, TCAM autonomously discovers multiple salient motions and their language descriptions without any external queries.
- Text Bank Utilization: A large text bank, precomputed from datasets (e.g., MeViS, YouTube-VOS, ActivityNet), captures the diversity of possible natural language expressions.
- Description Selection: The system computes similarities between and all text bank embeddings, then thresholds (e.g., top-70th percentile) or gathers top-scoring descriptions via multi-head cross-attention. The “multi-head” mechanism discovers –$5$ relevant expressions per video.
- Spatial Grounding: Each retrieved description is grounded to specific motion tracks by cross-attention:
Output tracks may be rendered as masks for per-frame, per-description region segmentation.
4. Training Protocol, Datasets, and Model Capacity
TCAM is trained and evaluated primarily on the MeViS dataset (2,006 videos, 28.6K motion expressions, 443K segmentation masks), with additional evaluation on HC-STVG.
- Optimization: AdamW, initial learning rate , OneCycleLR, , for 500 epochs using PyTorch DDP on 4 NVIDIA V100 GPUs.
- Model Dimensions: MFA hidden dimension with 16 attention heads; dense point tracks per video.
- Loss Weighting: Spatial grounding loss weight .
Larger text banks increase discovery diversity and coverage, with 150K expressions yielding 68.5% coverage and 84.7% precision.
5. Experimental Results and Benchmarks
TCAM attains high performance on both retrieval and spatial grounding metrics. Key results include:
| Setting / Metric | Value | Dataset |
|---|---|---|
| Video→Text R@1 | 58.4% (no query) | MeViS |
| Spatial grounding | 64.9% | MeViS |
| Expressions Discovered / video | 4.8 | MeViS |
| Coverage | 68.5% | MeViS (adaptive) |
| Precision | 84.7% | MeViS (adaptive) |
| m_vIoU | 42.3% | HC-STVG |
| m_tIoU | 52.8% | HC-STVG |
| [email protected] | 38.7% | HC-STVG |
Ablation studies emphasize the importance of MFA, the temporal encoder, multi-positive loss, and multi-head cross-attention for overall system performance. Removal of any major component yields a significant decrease in both retrieval and grounding accuracy.
6. Spatial Grounding and Evaluation Metrics
TCAM grounds each caption to a subset of motion tracks, forming per-description, sparse masks. The following metrics quantify grounding accuracy:
- Jaccard Index (IoU):
- Boundary F-score (): As per segmentation benchmarks.
- Combined Grounding Score:
- Tube IoU (HC-STVG): Spatio-temporal intersection metrics (m_vIoU, m_tIoU, [email protected]).
These metrics capture mask similarity per frame and across time, reflecting how accurately discovered expressions are spatially localized.
7. Limitations and Prospective Developments
TCAM’s reliance on pointwise motion trajectories entails inherited limitations:
- Robustness to Scene Discontinuities: Abrupt scene cuts or point-of-view shifts can disrupt trajectory identity, weakening spatial association between discovered motions and language.
- Point-Tracking Failures: Errors or dropouts in point tracking directly propagate to failures in semantic grounding.
- Future Improvements: Integration of temporal-consistency modules, explicit object re-identification, and end-to-end fine-tuning of CLIP may enhance robustness, especially for disorderly motion or long-range temporal reasoning.
A plausible implication is that future architectures combining dense motion correspondence with open-vocabulary vision–LLMs and strong temporal association will further improve autonomous motion discovery and compositional video understanding.
For a detailed methodological exposition, refer to "Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos" (Galoaa et al., 11 Dec 2025).