OpenLVD200M: 200M-Image Corpus for Vision Distillation
- OpenLVD200M is a 200 million-image dataset curated for efficient multi-teacher distillation, enabling precise and data-efficient representation learning.
- The corpus is built using a four-level hierarchical k-means clustering and balanced sampling strategy to ensure uniform concept coverage, overcoming long-tailed distributions.
- Designed for integration into self-supervised and distillation pipelines, OpenLVD200M improves convergence speed and downstream accuracy in vision foundation models.
OpenLVD200M is a 200 million-image training corpus constructed explicitly to optimize multi-teacher distillation workflows for vision foundation models. Generated through an aggressive hierarchical clustering and balanced sampling strategy from large-scale public datasets, OpenLVD200M addresses the inefficiencies of naïvely sampled web-scale data by ensuring uniform concept coverage, thereby enabling lower-cost, more data-efficient representation learning. The corpus is entirely unsupervised, containing only RGB images without provided captions or labels, and is designed for integration into distillation and self-supervised learning pipelines, notably underpinning AMoE (Agglomerative Mixture-of-Experts Vision Foundation Models) (Chaybouti et al., 23 Dec 2025).
1. Definition and Rationale
OpenLVD200M is defined as a 200 million-image dataset curated for efficient multi-teacher distillation, where a single student encoder learns joint representations from multiple frozen teacher models (such as SigLIP2 for image-text alignment and DINOv3 for dense visual geometry). Standard random sampling of large-scale web data is empirically insufficient: it leads to overrepresentation of head concepts and cannot affordably expose students to the tail of rare or fine-grained concepts. OpenLVD200M solves this by leveraging hierarchical k-means clustering and balanced sampling, ensuring uniform coverage across the semantic space and improving sample efficiency, convergence speed, and downstream accuracy relative to random subsets (Chaybouti et al., 23 Dec 2025).
2. Source Datasets and Corpus Composition
OpenLVD200M is drawn from a 2.3 billion-image superset comprising two major sources:
- LAION-5B, contributing approximately 2 billion CC-BY licensed images.
- DFN (Data Filtering Networks), adding around 300 million images from a CC0-like custom license.
The final OpenLVD200M set contains exactly 200,000,000 RGB images in mixed aspect ratios. It is strictly an unsupervised corpus with no accompanying textual labels or captions. Images span multiple resolutions, reflecting the original sources.
| Source | Images Contributed | License |
|---|---|---|
| LAION-5B | ~2,000,000,000 | CC-BY 4.0 |
| DFN | ~300,000,000 | Custom (CC0-like) |
| OpenLVD200M | 200,000,000 | CC-BY 4.0 (most permissive) |
3. Hierarchical Clustering and Balanced Sampling Pipeline
The construction of OpenLVD200M applies a four-level hierarchical k-means clustering pipeline to flatten the dataset’s long-tailed concept distribution:
- Level 1: centroids
- Level 2: centroids
- Level 3: centroids
- Level 4: centroids
Pipeline steps:
- All 2.3B candidate images are first embedded using the DINOv3-ViT-B encoder; 1B are extracted for clustering.
- K-means is computed at each level to find centroids , minimizing .
- The remaining 1.3B images are assigned to the nearest Level 1 centroids.
- Balanced hierarchical sampling is employed: each level’s clusters are sampled uniformly, then images within each cluster are sampled uniformly, yielding a final corpus with near-uniform representation of all Level-4 concepts.
The final per-image sampling weight is:
Images are drawn without replacement until 200M are selected. This approach systematically overcomes the long-tailed distribution challenge endemic to web-scraped datasets (Chaybouti et al., 23 Dec 2025).
4. Preprocessing and Training Integration
OpenLVD200M’s design explicitly supports advanced batching and training strategies:
- Token-Balanced Batching: Each accelerator processes image sequences as packed tokens; sequences combine images of different resolutions within an overall token budget (e.g., ) using FlexAttention masks that prevent inter-image attention. This maintains uniform load across GPUs, prevents over-weighting high-resolution examples, and normalizes losses per image by patch count.
- Two-Stage Distillation:
- Stage 1: All images resized to a maximum of pixels, retaining aspect ratio and relying on token batching for padding/packing.
- Stage 2: Joint mixing of:
- OpenLVD200M at
- Natural-size images between and
- High-resolution images downsampled to both and
Standard random cropping, horizontal flips, and color jitter are the only augmentations applied. No extra augmentations are required—most images are used in their native form.
| Image Resolution | Fraction of Corpus |
|---|---|
| 256×256 | 60% |
| 384×384 | 20% |
| 512×512 | 15% |
| >512 | 5% (Stage 2 only) |
5. Licensing, Distribution, and Usage Guidelines
OpenLVD200M inherits its licensing from the underlying datasets, resolving to CC-BY 4.0 to maximize permissivity. The corpus is distributed as a list of 200M URLs and corresponding SHA-256 hashes, supporting automated download and data integrity verification. No official validation split is provided; users are advised to carve out a 1M uniformly sampled hold-out set, stratified across clusters, for in-distribution evaluation.
Recommended best practice limits OpenLVD200M use to unsupervised distillation and self-supervised learning. Training supervised classifiers directly with this corpus requires compliance with the originating image licenses.
6. Empirical Impact and Benchmark Results
In the AMoE framework, OpenLVD200M enables highly sample-efficient training:
- AMoE trained on 200M OpenLVD images (230B tokens) achieves:
- State-of-the-art zero-shot image-text classification: macro-avg 84.13 (vs. 82.26 for RADIOv2.5, which used 1B random images and 1.1T tokens).
- Competitive segmentation and linear probe transfer.
- Superior Recall@1 for retrieval on MSCOCO5k and Flickr30k.
Ablation evidence:
- Substituting OpenLVD200M for a random 200M image subset reduces image-text classification by 4.15\% and kNN macro-avg by 2.42\%.
- For fine-grained long-tail benchmarks such as FGVC-Aircraft, OpenLVD200M delivers gains exceeding +18\% in top-1 accuracy, underscoring the effectiveness of hierarchical balancing for rare concepts (Chaybouti et al., 23 Dec 2025).
7. Loading, Data Access, and Practicalities
OpenLVD200M’s manifest is partitioned into shards of 1M URLs each. Cluster assignments are stored in an accompanying file (e.g., "openlvd_clusters.json"). Sampling and batching logic, as used in experimental pipelines, is illustrated by the following procedural outline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import random, json, requests clusters = json.load(open("openlvd_clusters.json")) lvl4_index = defaultdict(list) for img_id, (c1,c2,c3,c4) in clusters.items(): lvl4_index[c4].append(img_id) sampled = [] per_leaf = 200_000_000 // len(lvl4_index) for leaf, img_ids in lvl4_index.items(): sampled_ids = random.sample(img_ids, per_leaf) sampled.extend(sampled_ids) with open("train_200M.txt", "w") as f: for img_id in sampled: f.write(img_id + "\n") class OpenLVDLoader(torch.utils.data.Dataset): def __init__(self, manifest, transforms): self.ids = open(manifest).read().splitlines() self.transforms = transforms def __getitem__(self, idx): url = self.ids[idx] img = load_and_decode(url) # implement caching & retry return self.transforms(img) loader = DataLoader(OpenLVDLoader("train_200M.txt", tfs), batch_size=None, # handled by token packer collate_fn=token_packer) |
OpenLVD200M thus serves as a drop-in replacement for random web-scraped datasets in large-scale vision foundation model training, offering superior coverage, efficiency, and empirical performance in multi-teacher distillation settings (Chaybouti et al., 23 Dec 2025).