TimeLens-100K: High-Quality VTG Dataset
- The paper demonstrates that TimeLens-100K improves VTG training by using an automated, self-verifying re-annotation pipeline that reduces annotation errors to nearly 0%.
- TimeLens-100K comprises 100K query–segment pairs from 20K videos, sampled uniformly over 0–240 seconds, ensuring diverse and robust training inputs.
- Benchmark results show models trained on TimeLens-100K achieving up to an 18 percentage point increase in [email protected], underscoring its impact on performance.
TimeLens-100K is a large-scale, high-quality video temporal grounding (VTG) training dataset constructed via automated re-annotation with state-of-the-art multimodal LLMs (MLLMs). Unlike legacy VTG corpora with significant label ambiguity, TimeLens-100K implements rigorous 1^ and segment quality control, establishing it as a modern reference corpus for developing and benchmarking VTG-capable models (Zhang et al., 16 Dec 2025).
1. Dataset Composition and Statistical Properties
TimeLens-100K comprises approximately 100,000 query–segment pairs covering around 20,000 unique videos. The videos originate from diverse open VTG resources, including but not limited to CosMo-Cap, InternVid-VTime, DiDeMo, QueryD, and HiREST, resulting in about five annotations per video on average. All annotations are used strictly for training; validation and test splits are not provided—benchmarking is performed exclusively on the separate TimeLens-Bench.
The video duration distribution is controlled through uniform sampling over the 0–240 s interval, maintaining a mean duration of approximately 107 s, inherited from the sources. The corpus includes a modest long-video tail, with minimum durations around 1 s and maximums extending to several minutes.
Event queries are concise natural-language descriptions (generally 5–15 words; for example, “Person picks up the red cup and drinks from it”) with no imposed taxonomy, so event categories reflect the variety present in the source datasets, spanning daily activities, sports, cooking, and other situational domains.
2. Annotation Pipeline and Quality Control
The annotation process is fully automated and leverages a Gemini-2.5-Pro MLLM, directed by empirically tuned prompts that enforce consistency, coverage, and clarity. The core pipeline, presented in pseudocode within the source, proceeds as follows:
- For each video , the MLLM identifies a set of distinct events, seeking full temporal coverage.
- Each event is described by an unambiguous query and a temporal interval .
- The MLLM self-verifies annotations, enforcing:
- Uniqueness: No two queries describe the same segment (intersection over union, , with ).
- Existence: Each event occurs as stated and can be supported with visual evidence.
- Clarity: Queries avoid vague verbs and temporally leaking phrases.
- Samples failing these criteria are either regenerated or discarded.
Manual spot checks on the original datasets indicated an error rate of at least 35% for ambiguous, missing, or imprecise queries. TimeLens-100K’s automated rejection loop reduces automated annotation failures to ~0%. A plausible implication is that this systematic, self-rejecting approach substantially elevates effective dataset quality compared to prior VTG sets.
3. Data Format, Structure, and Metadata
The dataset’s directory layout reflects a modern, program-friendly architecture:
1 2 3 4 |
TimeLens-100K/
├─ videos/ # (symlinks or download links to original videos)
└─ annotations/
└─ timelens_100k.json |
Each annotation object in timelens_100k.json includes:
"video_id"(string): Unique identifier matching a video file."query"(string): Natural-language event description."start_time"/"end_time"(float, sec): Temporal bounds."duration"(float, sec): Video duration."fps"(int): Sampling rate used (typically 2 Hz)."width","height"(int): Frame dimensions in pixels.
There is no per-sample “annotator confidence” field. Instead, dataset-level statistics on query lengths, video durations, and source corpus origin are summarized in the accompanying documentation.
4. Algorithmic Underpinnings and Filtering Criteria
The annotation pipeline is supplemented by quantitative, algorithmic verification steps. Sample difficulty is defined via
Samples may be weighted during RL-based training using a Gaussian kernel:
with . The reward learning with verifiable rewards (RLVR) paradigm further instills ground-truth verifiability into model optimization, using rewards and corresponding generalized REINFORCE policy gradient objectives.
5. Licensing and Accessibility
The full dataset, code, and associated model weights are to be released under the Apache 2.0 license for annotations and model weights. Videos are subject to their original third-party licenses but are accessible via the provided directory structure (symlinks or download links). The official release venue is https://timelens-arc-lab.github.io/ (Zhang et al., 16 Dec 2025).
6. Comparative Assessment and Downstream Impact
Relative to previous VTG training sets (typically aggregating ∼50K examples), TimeLens-100K doubles the annotation count and institutes automated end-to-end verification. Prior datasets are estimated to have ≥ 30% ambiguous or no-event queries, while TimeLens-100K is functionally free from such failures due to enforced quality criteria.
In downstream evaluation, models trained on TimeLens-100K demonstrate substantial performance improvements. For instance, Qwen2.5-VL-7B trained on original noisy VTG data achieves 30.4% [email protected] and 35.6% mIoU on Charades; the same model trained on TimeLens-100K attains 53.9% [email protected] and 48.3% mIoU—an increase of 18 percentage points. Similar improvements of 12–15 points mIoU are reported for ActivityNet-TimeLens and QVHighlights-TimeLens benchmarks.
7. Significance and Broader Impact
TimeLens-100K enables the development of MLLMs with open-source reproducibility, bridging much of the performance gap with proprietary models (e.g., GPT-5, Gemini-2.5-Flash) in VTG. The automated, LLM-driven re-annotation mechanism and the enforced dataset design establish new evaluation and training standards for the video temporal grounding community (Zhang et al., 16 Dec 2025).