HowTo100M: Large-Scale Video-Language Dataset
- HowTo100M is a large-scale, weakly supervised instructional video corpus featuring 136M clip–caption pairs from over 1.22M YouTube videos across 23,611 tasks.
- The dataset is automatically collected using WikiHow-derived queries and YouTube ASR, resulting in diverse yet noisy video–text pairs that necessitate specialized denoising methods.
- It supports a range of applications including text–video retrieval, action recognition, grounded captioning, and multilingual video understanding through advanced representation learning.
The HowTo100M dataset is a large-scale, weakly supervised instructional video corpus introduced to advance video–language research by providing hundreds of millions of video clips paired with natural language narrations. Sourced from YouTube instructional content across more than 23,000 distinct visual tasks, HowTo100M offers an unprecedented scale of multimodal data and serves as a key foundation for text–video embedding, alignment, and grounded captioning research. Due to its automatic dataset generation and lack of manual annotation, HowTo100M is characterized by both its immense diversity and considerable label noise, necessitating specialized learning and denoising methodologies for effective exploitation in deep learning frameworks (Miech et al., 2019).
1. Dataset Composition and Construction
HowTo100M consists of approximately 1.22 million unique YouTube videos, yielding approximately 136 million clip–caption pairs and constituting over 134,000 hours of video. The corpus covers 23,611 visual tasks spanning categories such as Food & Entertaining, Home & Garden, Hobbies & Crafts, Cars & Other Vehicles, among others (Miech et al., 2019).
Videos were harvested using queries formulated from WikiHow articles. Subtitle acquisition relies primarily on YouTube's automatic speech recognition (ASR) system, producing per-segment (line-level) English subtitles with start/end timestamps. Each "clip–caption" sample is defined by pairing the video segment corresponding to the timestamped subtitle line with the line's transcript. The average video length is 6.5 minutes, yielding typical clips of 4 seconds, with an average of 110 clips per video.
Annotation is entirely automatic, and captions frequently exhibit temporal asynchrony, off-topic narration, or ASR errors. Manual checks reveal that only approximately 51% of pairs maintain direct visual/narrative alignment (Miech et al., 2019).
| Statistic | Value | Description |
|---|---|---|
| #Videos | 1.22M | Unique YouTube instructional videos |
| #Clip–caption pairs | 136M | Each with video segment + ASR caption |
| Avg. video length | 6.5 min | Per video |
| #Tasks | 23,611 | WikiHow-derived visual tasks |
| Language | English | ASR/human captions, later extended |
2. Extension to Multi-Modal and Multilingual Domains
HowTo100M supports not only video–text modalities but also video–audio–text triplets. Audio tracks are aligned by timestamp with video frames and subtitles (Wu et al., 2022). To facilitate research in multilingual video understanding, the dataset has been extended to Multi-HowTo100M, incorporating time-aligned subtitles in nine languages (English, German, French, Russian, Spanish, Czech, Swahili, Chinese, Vietnamese) either from user-contributed captions or via Google ASR-translate pipelines. In Multi-HowTo100M, supervision is distributed across languages at scale, with approximately ten billion subtitle tokens per language and about 1.1 million videos with at least one subtitle track (Huang et al., 2021).
3. Dataset Noise and Temporal Misalignment
HowTo100M's automatic collection process introduces significant noise and weak alignment in video–text pairs. Primary sources of noise include:
- Unalignable sentences: Up to 70% of subtitle lines may not directly describe visible actions, but instead provide off-topic commentary or narrative background.
- Temporal asynchrony: Descriptions may precede or follow the described action, often out of order.
- ASR errors: Lead to word substitutions, omitted words, and fragmented or duplicated subtitles.
A 10-hour manually annotated subset (HTM-Align) of 80 videos quantifies this misalignment: Of 49,000 ASR sentences, only ~13,000 were marked as visually alignable, typically less than one-third per video (Han et al., 2022).
4. Methodologies for Representation Learning and Denoising
Exploiting HowTo100M's scale requires robust learning frameworks capable of handling weak and noisy supervision. Several strategies have been developed:
4.1 Text–Video Embedding via Contrastive Learning
The initial HowTo100M embedding model encodes video and text into a joint space using deep CNNs (ResNet-152, ResNeXt-101 3D for video; word2vec+CNN for text). Learning employs a margin-based contrastive (ranking) loss that penalizes unmatched pairs, with intra- and inter-video negative sampling to maximize discrimination (Miech et al., 2019).
4.2 Temporal Alignment and Weak Supervision
For long-form video understanding, co-training methods operate two complementary networks: a Temporal Alignment Network (TAN) with joint video–text encoding, and an auxiliary dual encoder. Self-correction mechanisms through agreement on pseudo-labels (temporal windows and alignability confidence) allow denoising and hard negative mining without manual annotation. Contrastive losses and cross-entropy are layered for robust temporal alignment. This results in cleaner pseudo-aligned data, which, when used for downstream representation learning, yields substantial improvements in both text-video alignment and action recognition tasks (Han et al., 2022).
4.3 Multimodal Contrastive Pre-training with Gradient Conflict Mitigation
For video–audio–text pre-training, frameworks such as VATT leverage InfoNCE and MIL-NCE losses for video–audio and video–text pairs, respectively. The pipeline identifies gradient conflict between modal losses as an indicator of alignment noisiness, applying gradient surgery (orthogonal projection) and curriculum learning to emphasize high-quality triplets early in training. This gradient harmonization enhances downstream retrieval and classification stability (Wu et al., 2022).
5. Grounded Video Captioning and Object Grounding
Recent advances leverage HowTo100M as a raw source for large-scale datasets in grounded video captioning, notably the HowToGround and HowToGround1M corpora (Kazakos et al., 2024, Kazakos et al., 13 Mar 2025). These are built via a three-stage LLM-driven automatic annotation pipeline:
- Frame-wise grounded captioning: Still-image grounded captioners (e.g., GLaMM) yield per-frame captions and segmentation masks, which are converted to bounding boxes.
- Video-level caption aggregation: Extracted subject-verb-object triplets from frame captions are summarized into concise video-level captions, with explicit noun phrase marking.
- Phrase-to-track assignment: A LLM classifies frame-level noun phrases to the video-level tags, producing temporally linked bounding box tubes.
Key statistics for HowToGround1M include 1 million videos, 43.6M frames, 80M bounding box tubes, and 3.2 million noun-phrase mentions (Kazakos et al., 13 Mar 2025). These large-scale pseudo-labeled datasets facilitate pre-training of models such as VideoGround and GROVE, which achieve state-of-the-art results in grounded caption generation benchmarks. Multi-task objectives combine language modeling for captioning with losses for bounding box regression and temporal objectness, leveraging loss terms for gIoU, L1, and binary cross-entropy:
with all λ set to 1 for VideoGround, or weighted (caption:grounding:objectness=1:2:2:2) for GROVE (Kazakos et al., 2024, Kazakos et al., 13 Mar 2025).
6. Downstream Applications and Impact
HowTo100M pre-training, combined with targeted fine-tuning, delivers transferability across a spectrum of benchmarks:
- Text-to-video retrieval: On YouCook2, off-the-shelf embeddings outperform or match domain-specific models; fine-tuned embeddings yield state-of-the-art results (Miech et al., 2019, Han et al., 2022).
- Action segmentation and recognition: TAN-based models zero-shot transfer to datasets such as Breakfast-Action, with significant gains in F-Accuracy and mean IoU (Han et al., 2022).
- Grounded Captioning: GROVE and VideoGround achieve superior CIDEr and AP₅₀ scores on iGround, VidSTG, and ActivityNet-Entities (Kazakos et al., 13 Mar 2025).
- Zero-shot cross-lingual retrieval: Multi-HowTo100M substantially boosts retrieval metrics in languages other than English on MSR-VTT and VATEX (Huang et al., 2021).
The scale of HowTo100M not only supports competitive performance in low-annotation regimes (fine-tuning on small subsets suffices) but also enables progress in challenging scenarios such as video object grounding and spatio-temporal modeling (Kazakos et al., 2024, Kazakos et al., 13 Mar 2025).
7. Limitations and Best Practices
While its scale and diversity are unmatched, HowTo100M remains limited by annotation noise and temporal misalignment. Best practices for leveraging the dataset include:
- Using it for pre-training universal video–language backbones, followed by task-specific fine-tuning.
- Incorporating denoising, hard negative mining, and alignment-aware pre-processing for tasks demanding high precision.
- Applying LLM-driven pipelines for object grounding, given the lack of explicit spatial annotation in the raw dataset.
- Employing cross-modal harmony techniques, such as gradient harmonization, for optimizing representation learning in the presence of weak supervision.
Manual benchmarks (e.g., HTM-Align, iGround, GROC) remain essential for accurate evaluation and protocol tuning due to the intrinsic noisiness of the automatic labels.
References
- "HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips" (Miech et al., 2019)
- "Temporal Alignment Networks for Long-term Video" (Han et al., 2022)
- "Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization" (Wu et al., 2022)
- "Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-LLMs" (Huang et al., 2021)
- "Grounded Video Caption Generation" (Kazakos et al., 2024)
- "Large-scale Pre-training for Grounded Video Caption Generation" (Kazakos et al., 13 Mar 2025)