Learning from 136 Million Video Clips: HowTo100M

This presentation explores HowTo100M, a groundbreaking approach to learning text-video embeddings from 136 million narrated instructional video clips. The authors introduce a massive dataset collected from 1.22 million YouTube instructional videos and demonstrate how weakly-supervised learning at this unprecedented scale enables state-of-the-art performance on video retrieval and action localization tasks across multiple benchmarks.
Script
Imagine trying to teach a machine to understand video by showing it every cooking tutorial, home repair guide, and craft demonstration on the internet. That is exactly what the authors of this paper set out to do, and the results fundamentally changed how we think about learning from video at scale.
Building on that vision, let's examine the core problem they tackled.
Traditional approaches hit a wall because creating labeled video datasets is incredibly expensive. The authors recognized that instructional videos on YouTube already contain perfectly synchronized narration describing what's happening on screen, creating a massive source of weakly-supervised training data just waiting to be harvested.
This insight led them to create something unprecedented.
The resulting HowTo100M dataset dwarfs everything that came before it. While previous datasets contained thousands of clips, HowTo100M delivers 136 million clips spanning everything from knitting to electrical repair, all collected automatically from freely available web videos.
Now let's explore how they turned this data into powerful embeddings.
The model learns to map both video clips and their narrations into a shared embedding space. They use a max-margin ranking loss with clever negative sampling strategies, ensuring that matching video-text pairs stay close together while unrelated pairs are pushed apart.
This visualization beautifully demonstrates what the model learned. Each row shows clips retrieved based on embedding similarity, organized into semantic clusters. Notice how the model successfully groups together clips about knitting, measuring wood, seasoning food, and electrical work, proving it captured meaningful visual and linguistic patterns across wildly different instructional domains.
The real test came when they evaluated these embeddings on established benchmarks.
The results speak for themselves. On CrossTask, the embeddings achieved 33.6 percent recall, actually surpassing methods trained on manually labeled data. After fine-tuning on MSR-VTT, they hit 52.8 percent recall, establishing new state-of-the-art benchmarks and proving that scale matters enormously in multi-modal learning.
This work fundamentally shifts our understanding of what's possible with large-scale weakly-supervised learning. The authors demonstrated that with the right data source and enough scale, we can learn robust multi-modal representations without expensive human labeling, a finding with profound implications for future video understanding research.
HowTo100M proves that sometimes the best teacher is simply watching millions of people explain how things work. Visit EmergentMind.com to explore more cutting-edge research transforming how machines learn from video.