OpenLVD200M: 200M-Image Corpus for Vision Distillation

Updated 30 December 2025

OpenLVD200M is a 200 million-image dataset curated for efficient multi-teacher distillation, enabling precise and data-efficient representation learning.
The corpus is built using a four-level hierarchical k-means clustering and balanced sampling strategy to ensure uniform concept coverage, overcoming long-tailed distributions.
Designed for integration into self-supervised and distillation pipelines, OpenLVD200M improves convergence speed and downstream accuracy in vision foundation models.

OpenLVD200M is a 200 million-image training corpus constructed explicitly to optimize multi-teacher distillation workflows for vision foundation models. Generated through an aggressive hierarchical clustering and balanced sampling strategy from large-scale public datasets, OpenLVD200M addresses the inefficiencies of naïvely sampled web-scale data by ensuring uniform concept coverage, thereby enabling lower-cost, more data-efficient representation learning. The corpus is entirely unsupervised, containing only RGB images without provided captions or labels, and is designed for integration into distillation and self-supervised learning pipelines, notably underpinning AMoE (Agglomerative Mixture-of-Experts Vision Foundation Models) (Chaybouti et al., 23 Dec 2025).

1. Definition and Rationale

OpenLVD200M is defined as a 200 million-image dataset curated for efficient multi-teacher distillation, where a single student encoder learns joint representations from multiple frozen teacher models (such as SigLIP2 for image-text alignment and DINOv3 for dense visual geometry). Standard random sampling of large-scale web data is empirically insufficient: it leads to overrepresentation of head concepts and cannot affordably expose students to the tail of rare or fine-grained concepts. OpenLVD200M solves this by leveraging hierarchical k-means clustering and balanced sampling, ensuring uniform coverage across the semantic space and improving sample efficiency, convergence speed, and downstream accuracy relative to random subsets (Chaybouti et al., 23 Dec 2025).

2. Source Datasets and Corpus Composition

OpenLVD200M is drawn from a 2.3 billion-image superset comprising two major sources:

LAION-5B, contributing approximately 2 billion CC-BY licensed images.
DFN (Data Filtering Networks), adding around 300 million images from a CC0-like custom license.

The final OpenLVD200M set contains exactly 200,000,000 RGB images in mixed aspect ratios. It is strictly an unsupervised corpus with no accompanying textual labels or captions. Images span multiple resolutions, reflecting the original sources.

Source	Images Contributed	License
LAION-5B	~2,000,000,000	CC-BY 4.0
DFN	~300,000,000	Custom (CC0-like)
OpenLVD200M	200,000,000	CC-BY 4.0 (most permissive)

3. Hierarchical Clustering and Balanced Sampling Pipeline

The construction of OpenLVD200M applies a four-level hierarchical k-means clustering pipeline to flatten the dataset’s long-tailed concept distribution:

Level 1: $K_1 = 20,000,000$ centroids
Level 2: $K_2 = 500,000$ centroids
Level 3: $K_3 = 50,000$ centroids
Level 4: $K_4 = 20,000$ centroids

Pipeline steps:

All 2.3B candidate images are first embedded using the DINOv3-ViT-B encoder; 1B are extracted for clustering.
K-means is computed at each level $\ell$ to find centroids $\{\mu_{\ell, j}\}_{j=1..K_\ell}$ , minimizing $\sum_{i \in I} \min_{j \in [1..K_\ell]} \|x_i - \mu_{\ell,j}\|^2$ .
The remaining 1.3B images are assigned to the nearest Level 1 centroids.
Balanced hierarchical sampling is employed: each level’s clusters are sampled uniformly, then images within each cluster are sampled uniformly, yielding a final corpus with near-uniform representation of all Level-4 concepts.

The final per-image sampling weight is:

$w_i \propto \prod_{\ell=1}^4 \frac{1}{|C_\ell(\text{cluster}(i))|}$

Images are drawn without replacement until 200M are selected. This approach systematically overcomes the long-tailed distribution challenge endemic to web-scraped datasets (Chaybouti et al., 23 Dec 2025).

4. Preprocessing and Training Integration

OpenLVD200M’s design explicitly supports advanced batching and training strategies:

Token-Balanced Batching: Each accelerator processes image sequences as packed tokens; sequences combine images of different resolutions within an overall token budget (e.g., $C_\text{max}=20,000$ ) using FlexAttention masks that prevent inter-image attention. This maintains uniform load across GPUs, prevents over-weighting high-resolution examples, and normalizes losses per image by patch count.
Two-Stage Distillation:
- Stage 1: All images resized to a maximum of $256 \times 256$ pixels, retaining aspect ratio and relying on token batching for padding/packing.
- Stage 2: Joint mixing of:
- OpenLVD200M at $K_2 = 500,000$ 0
- Natural-size images between $K_2 = 500,000$ 1 and $K_2 = 500,000$ 2
- High-resolution images downsampled to both $K_2 = 500,000$ 3 and $K_2 = 500,000$ 4

Standard random cropping, horizontal flips, and color jitter are the only augmentations applied. No extra augmentations are required—most images are used in their native form.

Image Resolution	Fraction of Corpus
256×256	60%
384×384	20%
512×512	15%
>512	5% (Stage 2 only)

5. Licensing, Distribution, and Usage Guidelines

OpenLVD200M inherits its licensing from the underlying datasets, resolving to CC-BY 4.0 to maximize permissivity. The corpus is distributed as a list of 200M URLs and corresponding SHA-256 hashes, supporting automated download and data integrity verification. No official validation split is provided; users are advised to carve out a 1M uniformly sampled hold-out set, stratified across clusters, for in-distribution evaluation.

Recommended best practice limits OpenLVD200M use to unsupervised distillation and self-supervised learning. Training supervised classifiers directly with this corpus requires compliance with the originating image licenses.

6. Empirical Impact and Benchmark Results

In the AMoE framework, OpenLVD200M enables highly sample-efficient training:

AMoE trained on 200M OpenLVD images ( $K_2 = 500,000$ $K_{2} = 500, 000$ 5230B tokens) achieves:
- State-of-the-art zero-shot image-text classification: macro-avg 84.13 (vs. 82.26 for RADIOv2.5, which used 1B random images and $K_2 = 500,000$ 61.1T tokens).
- Competitive segmentation and linear probe transfer.
- Superior Recall@1 for retrieval on MSCOCO5k and Flickr30k.

Ablation evidence:

Substituting OpenLVD200M for a random 200M image subset reduces image-text classification by $K_2 = 500,000$ 74.15\% and kNN macro-avg by $K_2 = 500,000$ 82.42\%.
For fine-grained long-tail benchmarks such as FGVC-Aircraft, OpenLVD200M delivers gains exceeding +18\% in top-1 accuracy, underscoring the effectiveness of hierarchical balancing for rare concepts (Chaybouti et al., 23 Dec 2025).

7. Loading, Data Access, and Practicalities

OpenLVD200M’s manifest is partitioned into shards of 1M URLs each. Cluster assignments are stored in an accompanying file (e.g., "openlvd_clusters.json"). Sampling and batching logic, as used in experimental pipelines, is illustrated by the following procedural outline:

$K_2 = 500,000$ 9 Collation uses the token-balanced packer from Dehghani et al. (2023) or FlexAttention code. Images are loaded, decoded, and transformed for training, with batching handled at the token level for optimal throughput.

OpenLVD200M thus serves as a drop-in replacement for random web-scraped datasets in large-scale vision foundation model training, offering superior coverage, efficiency, and empirical performance in multi-teacher distillation settings (Chaybouti et al., 23 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenLVD200M.