Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenLVD200M: 200M-Image Corpus for Vision Distillation

Updated 30 December 2025
  • OpenLVD200M is a 200 million-image dataset curated for efficient multi-teacher distillation, enabling precise and data-efficient representation learning.
  • The corpus is built using a four-level hierarchical k-means clustering and balanced sampling strategy to ensure uniform concept coverage, overcoming long-tailed distributions.
  • Designed for integration into self-supervised and distillation pipelines, OpenLVD200M improves convergence speed and downstream accuracy in vision foundation models.

OpenLVD200M is a 200 million-image training corpus constructed explicitly to optimize multi-teacher distillation workflows for vision foundation models. Generated through an aggressive hierarchical clustering and balanced sampling strategy from large-scale public datasets, OpenLVD200M addresses the inefficiencies of naïvely sampled web-scale data by ensuring uniform concept coverage, thereby enabling lower-cost, more data-efficient representation learning. The corpus is entirely unsupervised, containing only RGB images without provided captions or labels, and is designed for integration into distillation and self-supervised learning pipelines, notably underpinning AMoE (Agglomerative Mixture-of-Experts Vision Foundation Models) (Chaybouti et al., 23 Dec 2025).

1. Definition and Rationale

OpenLVD200M is defined as a 200 million-image dataset curated for efficient multi-teacher distillation, where a single student encoder learns joint representations from multiple frozen teacher models (such as SigLIP2 for image-text alignment and DINOv3 for dense visual geometry). Standard random sampling of large-scale web data is empirically insufficient: it leads to overrepresentation of head concepts and cannot affordably expose students to the tail of rare or fine-grained concepts. OpenLVD200M solves this by leveraging hierarchical k-means clustering and balanced sampling, ensuring uniform coverage across the semantic space and improving sample efficiency, convergence speed, and downstream accuracy relative to random subsets (Chaybouti et al., 23 Dec 2025).

2. Source Datasets and Corpus Composition

OpenLVD200M is drawn from a 2.3 billion-image superset comprising two major sources:

  • LAION-5B, contributing approximately 2 billion CC-BY licensed images.
  • DFN (Data Filtering Networks), adding around 300 million images from a CC0-like custom license.

The final OpenLVD200M set contains exactly 200,000,000 RGB images in mixed aspect ratios. It is strictly an unsupervised corpus with no accompanying textual labels or captions. Images span multiple resolutions, reflecting the original sources.

Source Images Contributed License
LAION-5B ~2,000,000,000 CC-BY 4.0
DFN ~300,000,000 Custom (CC0-like)
OpenLVD200M 200,000,000 CC-BY 4.0 (most permissive)

3. Hierarchical Clustering and Balanced Sampling Pipeline

The construction of OpenLVD200M applies a four-level hierarchical k-means clustering pipeline to flatten the dataset’s long-tailed concept distribution:

  • Level 1: K1=20,000,000K_1 = 20,000,000 centroids
  • Level 2: K2=500,000K_2 = 500,000 centroids
  • Level 3: K3=50,000K_3 = 50,000 centroids
  • Level 4: K4=20,000K_4 = 20,000 centroids

Pipeline steps:

  1. All 2.3B candidate images are first embedded using the DINOv3-ViT-B encoder; 1B are extracted for clustering.
  2. K-means is computed at each level \ell to find centroids {μ,j}j=1..K\{\mu_{\ell, j}\}_{j=1..K_\ell}, minimizing iIminj[1..K]xiμ,j2\sum_{i \in I} \min_{j \in [1..K_\ell]} \|x_i - \mu_{\ell,j}\|^2.
  3. The remaining 1.3B images are assigned to the nearest Level 1 centroids.
  4. Balanced hierarchical sampling is employed: each level’s clusters are sampled uniformly, then images within each cluster are sampled uniformly, yielding a final corpus with near-uniform representation of all Level-4 concepts.

The final per-image sampling weight is:

wi=141C(cluster(i))w_i \propto \prod_{\ell=1}^4 \frac{1}{|C_\ell(\text{cluster}(i))|}

Images are drawn without replacement until 200M are selected. This approach systematically overcomes the long-tailed distribution challenge endemic to web-scraped datasets (Chaybouti et al., 23 Dec 2025).

4. Preprocessing and Training Integration

OpenLVD200M’s design explicitly supports advanced batching and training strategies:

  • Token-Balanced Batching: Each accelerator processes image sequences as packed tokens; sequences combine images of different resolutions within an overall token budget (e.g., Cmax=20,000C_\text{max}=20,000) using FlexAttention masks that prevent inter-image attention. This maintains uniform load across GPUs, prevents over-weighting high-resolution examples, and normalizes losses per image by patch count.
  • Two-Stage Distillation:
    • Stage 1: All images resized to a maximum of 256×256256 \times 256 pixels, retaining aspect ratio and relying on token batching for padding/packing.
    • Stage 2: Joint mixing of:
    • OpenLVD200M at 256×256256 \times 256
    • Natural-size images between 256×256256 \times 256 and 384×384384 \times 384
    • High-resolution images downsampled to both 256×256256 \times 256 and 512×512512 \times 512

Standard random cropping, horizontal flips, and color jitter are the only augmentations applied. No extra augmentations are required—most images are used in their native form.

Image Resolution Fraction of Corpus
256×256 60%
384×384 20%
512×512 15%
>512 5% (Stage 2 only)

5. Licensing, Distribution, and Usage Guidelines

OpenLVD200M inherits its licensing from the underlying datasets, resolving to CC-BY 4.0 to maximize permissivity. The corpus is distributed as a list of 200M URLs and corresponding SHA-256 hashes, supporting automated download and data integrity verification. No official validation split is provided; users are advised to carve out a 1M uniformly sampled hold-out set, stratified across clusters, for in-distribution evaluation.

Recommended best practice limits OpenLVD200M use to unsupervised distillation and self-supervised learning. Training supervised classifiers directly with this corpus requires compliance with the originating image licenses.

6. Empirical Impact and Benchmark Results

In the AMoE framework, OpenLVD200M enables highly sample-efficient training:

  • AMoE trained on 200M OpenLVD images (\sim230B tokens) achieves:
    • State-of-the-art zero-shot image-text classification: macro-avg 84.13 (vs. 82.26 for RADIOv2.5, which used 1B random images and \sim1.1T tokens).
    • Competitive segmentation and linear probe transfer.
    • Superior Recall@1 for retrieval on MSCOCO5k and Flickr30k.

Ablation evidence:

  • Substituting OpenLVD200M for a random 200M image subset reduces image-text classification by \sim4.15\% and kNN macro-avg by \sim2.42\%.
  • For fine-grained long-tail benchmarks such as FGVC-Aircraft, OpenLVD200M delivers gains exceeding +18\% in top-1 accuracy, underscoring the effectiveness of hierarchical balancing for rare concepts (Chaybouti et al., 23 Dec 2025).

7. Loading, Data Access, and Practicalities

OpenLVD200M’s manifest is partitioned into shards of 1M URLs each. Cluster assignments are stored in an accompanying file (e.g., "openlvd_clusters.json"). Sampling and batching logic, as used in experimental pipelines, is illustrated by the following procedural outline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import random, json, requests
clusters = json.load(open("openlvd_clusters.json"))
lvl4_index = defaultdict(list)
for img_id, (c1,c2,c3,c4) in clusters.items():
    lvl4_index[c4].append(img_id)
sampled = []
per_leaf = 200_000_000 // len(lvl4_index)
for leaf, img_ids in lvl4_index.items():
    sampled_ids = random.sample(img_ids, per_leaf)
    sampled.extend(sampled_ids)
with open("train_200M.txt", "w") as f:
    for img_id in sampled:
        f.write(img_id + "\n")
class OpenLVDLoader(torch.utils.data.Dataset):
    def __init__(self, manifest, transforms):
        self.ids = open(manifest).read().splitlines()
        self.transforms = transforms
    def __getitem__(self, idx):
        url = self.ids[idx]
        img = load_and_decode(url)     # implement caching & retry
        return self.transforms(img)
loader = DataLoader(OpenLVDLoader("train_200M.txt", tfs),
                    batch_size=None,  # handled by token packer
                    collate_fn=token_packer)
Collation uses the token-balanced packer from Dehghani et al. (2023) or FlexAttention code. Images are loaded, decoded, and transformed for training, with batching handled at the token level for optimal throughput.

OpenLVD200M thus serves as a drop-in replacement for random web-scraped datasets in large-scale vision foundation model training, offering superior coverage, efficiency, and empirical performance in multi-teacher distillation settings (Chaybouti et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenLVD200M.