Video Pointing Dataset Overview
- Video pointing datasets are cutting-edge resources offering pixel-level, spatio-temporal annotations for precise localization of objects, actions, and interactions in video frames.
- They employ diverse annotation methods—from 2D points and gaze vectors to 3D scene tracking—to reduce labeling complexity while maintaining task accuracy.
- These datasets underpin advanced applications in video object segmentation, multimodal grounding, and interactive localization across robotics, AR/VR, and vision-language systems.
Video pointing datasets constitute a foundational resource for training and benchmarking algorithms designed to spatially and temporally localize instances, actions, or human interactions within videos via discrete pixel-level annotations. These datasets depart from traditional mask- or bounding-box-centric paradigms by emphasizing concise, spatio-temporal point annotations for grounding and tracking tasks, thus aligning with current trends in interactive localization, multimodal reasoning, and efficient annotation in computer vision.
1. Definition and Taxonomy
A video pointing dataset is defined by its provision of pixel-level temporal annotations (points, clicks, or directions) identifying objects, actions, or relevant semantic events in video frames. The overarching taxonomy includes:
- Temporal localization with spatial granularity: e.g., Molmo2-VideoPoint delineates each instance by a point (frame index/timestamp, (x, y) location), not a dense mask or box (Clark et al., 15 Jan 2026).
- Gesture or interactive pointing: datasets where human pointing (fingertip, gaze, or head orientation) specifies intent, interaction, or saliency, such as EgoGestAR’s FPV gesture corpus (Jain et al., 2019) or Panonut360’s eye/head tracking (Xu et al., 2024).
- Sparse object or region annotation: as in Point-VOS, where annotation protocol samples points for weak-supervision segmentation and grounding (Zulfikar et al., 2024).
- 3D scene-point tracking: TAPVid-3D records camera-space coordinates ((X, Y, Z)), mapping pixel queries to scene geometry for dynamic video content (Koppula et al., 2024).
A plausible implication is that video pointing datasets serve dual roles as both ground truth for multimodal grounding tasks and as efficient, scalable annotation regimes reducing the burden of dense mask labeling.
2. Data Collection and Annotation Protocols
Annotation methodologies in video pointing datasets are highly specialized for efficiency, clarity, and generalizability:
- Molmo2-VideoPoint: Annotators select the relevant frame and submit pixel-level coordinates per query (normalized to [0, 1000] scale), guided by free-form referring expressions and subject to LLM and segmentation-based validation (Clark et al., 15 Jan 2026).
- VideoMolmo: Semi-automatic annotation tools use mask-driven point selection, favoring interior points via Euclidean distance and optimizing for maximal IoU with SAM2-generated segmentation (Ahmad et al., 5 Jun 2025).
- Point-VOS: Sparse annotation protocol samples up to 14.7 human-verified points (foreground, background, ambiguous) per object per frame, with temporal sparsity achieved by annotating only 10/frames per object (Zulfikar et al., 2024).
- GIFT: An automated pipeline samples surface points on 3D models, projecting them into 2D for real/synthetic video rendering, and tracks each point’s visibility under complex camera trajectories (Huang et al., 17 Mar 2025).
- Panonut360: Head and gaze pointing are captured via HMD and eye tracker at 120 Hz; points encode instantaneous 3D directional vectors, further transformed to projected image coordinates (Xu et al., 2024).
- EgoGestAR: Fingertip or hand trajectory logging via tablet/camera is used for gesture definition; video runs employ a fingertip regressor as a surrogate for dense annotation (Jain et al., 2019).
Table 1 summarizes annotation characteristics of major datasets:
| Dataset | Annotation Mode | Typical Point Density/frame | Temporal Coverage |
|---|---|---|---|
| Molmo2-VideoPoint | 2D point/click | 6.0 per query | 1.8 frames per query |
| VideoMolmo | Mask-derived point | 1 per object per frame | All frames, with prompt fusion |
| Point-VOS | Sparse points | ∼14.7 per frame/object | 10 frames/object |
| TAPVid-3D | Camera-space 3D pts | 50–1024 per clip | Full clip (25–300 frames) |
| EgoGestAR | Fingertip traj | Variable | Full gesture duration @30FPS |
| Panonut360 | Directional vector | 1 per sample (120Hz) | Full video duration |
3. Dataset Structure, Coverage, and Domains
Video pointing datasets exhibit broad coverage in domains, scale, and point annotation schemes:
- Molmo2-VideoPoint: 250,000 training videos, 450,000 frame–point pairs, spanning eight query categories (objects, animals, actions, spatial, comparative, generative artifacts) (Clark et al., 15 Jan 2026).
- VideoMolmo: 72,000 video–caption pairs, ~100,000 object points; real-world scenarios include cell tracking, egocentric vision, autonomous driving, GUI interaction, and robotics (Ahmad et al., 5 Jun 2025).
- Point-VOS: Aggregates 19M annotations across 133K objects in 32K videos, with open-vocabulary labels and diversified action/taxonomy via PV-Oops and PV-Kinetics splits (Zulfikar et al., 2024).
- TAPVid-3D: Over 4,569 clips, three data sources (Aria Digital Twin, DriveTrack, Panoptic Studio), with dense camera-space 3D point tracks and visibility flags (Koppula et al., 2024).
- GIFT: Synthetic indoor scenes, 1,800 videos, 5–10 points per video, stratified by texture intensity and camera motion complexity (Huang et al., 17 Mar 2025).
- Panonut360: 15 panoramic videos, 50 participants; head/gaze vectors and projected coordinates for user attention localization (Xu et al., 2024).
- EgoGestAR: 2,500 gesture trajectories (touch), 240 egocentric video clips; FPV pointing for AR/MR (Jain et al., 2019).
A plausible implication is that researchers can select datasets aligned to video type (real/synthetic), annotation density, and grounding granularity required for their application.
4. Annotation Formats, Storage, and Access
Standardized annotation formats foster accessibility and interoperability:
- Molmo2-VideoPoint: Each record is a JSON object: includes video ID, query string, annotations (frame_index, timestamp_sec, object_id, x_norm, y_norm), directory split by train/valid/test (Clark et al., 15 Jan 2026).
- VideoMolmo: Point annotations per frame/object, stored alongside referring text and masks, with representative pixel coordinates (Ahmad et al., 5 Jun 2025).
- Point-VOS: JSON with frame size, objects, frame indices, points (x, y, type), open-vocabulary object nouns (Zulfikar et al., 2024).
- TAPVid-3D: Numpy arrays of 3D tracks (tracks_xyz), query_xyt, visibility flags, camera intrinsics; RGB frames remain in original formats (Koppula et al., 2024).
- EgoGestAR: CSV or plain-text logs for touch trajectories, raw .mp4 for video clips (Jain et al., 2019).
- Panonut360: Head/eye CSV files, converted to pixel coordinates by scripting, with per-second saliency map outputs (Xu et al., 2024).
Licensing is typically permissive (Apache, MIT, CC BY-NC 4.0), though non-commercial restrictions may apply in some cases.
5. Supported Tasks and Evaluation Protocols
Video pointing datasets support a spectrum of localization, grounding, and tracking tasks via standardized metrics:
- Molmo2-VideoPoint: Primary task is spatio-temporal pointing by instance; metrics are precision, recall, F1, valid-parsing accuracy (based on click inclusion within ground-truth masks) (Clark et al., 15 Jan 2026).
- VideoMolmo: Evaluation on spatio-temporal pointing (precision, recall, F1) and mask fusion (region Jaccard, boundary F-measure, J–F average), with counting metrics (EMA, MAE) (Ahmad et al., 5 Jun 2025).
- Point-VOS: Assesses weakly supervised segmentation via J&F=(Jaccard+F)/2, using sparse points for initialization and mask derivation (Zulfikar et al., 2024).
- TAPVid-3D: Tracks Average Precision-Distance (APD₃D(δ)), Occlusion Accuracy (OA), 3D Average Jaccard (AJ₃D), with scale ambiguity correction and spatio-temporal smoothness (Koppula et al., 2024).
- GIFT: Uses δ_{avg}{vis}(τ) (visible point threshold), AJ, OA, evaluated across texture-level subsets; baselines include RAFT, TAPIR, CoTracker (Huang et al., 17 Mar 2025).
- EgoGestAR: Gesture classification accuracy (LSTM/Bi-LSTM), latency (on-device), trajectory-based validation (Jain et al., 2019).
- Panonut360: Head/gaze offset statistics, saliency map generation by aggregated fixations, assessment of centrality and anisotropy of attention regions (Xu et al., 2024).
A plausible implication is that the “point-driven” paradigm enables benchmarking of both pure geometric localization and multimodal reasoning under weak supervision, supporting both classical vision and vision-LLM training.
6. Current State, Baselines, and Limitations
Summary baselines from recent benchmarks articulate current performance ceilings and error modes:
| Model | Precision (%) | Recall (%) | F1 (%) | Dataset |
|---|---|---|---|---|
| Molmo2-4B | 39.4 | 42.7 | 39.9 | Molmo2-VideoPoint (Clark et al., 15 Jan 2026) |
| Molmo2-8B | 38.7 | 39.3 | 38.4 | Molmo2-VideoPoint |
| Gemini 3 Pro | 19.8 | 27.4 | 20.0 | Molmo2-VideoPoint |
| CoTracker+ZoeDepth | 14.8 | — | 8.7 (AJ₃D) | TAPVid-3D (Koppula et al., 2024) |
| Static baseline | 9.4 | — | 4.9 (AJ₃D) | TAPVid-3D |
Key limitations:
- Datasets such as GIFT and TAPVid-3D suffer from domain-constrained coverage (e.g., synthetic indoor scenes, driving, staged actions). Textureless regions and depth scale ambiguity remain significant error modes (Huang et al., 17 Mar 2025, Koppula et al., 2024).
- Annotator variability and runtime regressors induce noise (EgoGestAR's finger positions, Panonut360's eye vector estimation) (Jain et al., 2019, Xu et al., 2024).
- Many datasets (Point-VOS, Molmo2-VideoPoint) rely on sparse spatial and/or temporal sampling for scalability, which limits applications requiring dense trajectory or segmentation supervision (Zulfikar et al., 2024, Clark et al., 15 Jan 2026).
Recommended practices include augmentation with real data, joint training on auxiliary modalities (depth, normals, flow), and use of rich annotation schemas for self-supervised approaches.
7. Applications and Future Directions
Video pointing datasets are increasingly utilized for:
- Multimodal grounding in VLMs: Open datasets (Molmo2-VideoPoint, VideoMolmo) empower progress in vision-language pixel grounding, counting, and reasoning, unattainable with closed proprietary corpora (Clark et al., 15 Jan 2026, Ahmad et al., 5 Jun 2025).
- Point-driven video object segmentation: Sparse point annotations bootstrap competitive performance in VOS with efficiency gains (Point-VOS achieves ≈80% full-mask baseline directly, >90% using pseudo-masks) (Zulfikar et al., 2024).
- Robotics and AR/VR interaction: Finger and gaze datasets (EgoGestAR, Panonut360) enable calibration and learning of intent, saliency, and control interfaces for edge deployment (Jain et al., 2019, Xu et al., 2024).
- Robust point tracking in challenging conditions: Synthetic benchmarks like GIFT stress-test algorithms in textureless, occluded, or motion-complex scenes, facilitating development of tracking models robust to real-world phenomena (Huang et al., 17 Mar 2025).
- 3D point tracking for dynamic scenes: TAPVid-3D extends trajectory annotation to three dimensions, supporting research in motion, deformation, and manipulation in complex environments (Koppula et al., 2024).
Future directions include extension to dense scene flow, semantic segmentation, dynamic scene representation, cross-domain adaptation, and integration with language-driven interactive systems. Expansion into underrepresented domains (sports, wildlife, in-the-wild scenes) is indicated as a research need (Koppula et al., 2024, Huang et al., 17 Mar 2025).