DINOv3: Self-Distillation without Labels
- The paper presents a self-distillation framework that integrates DINO, iBOT, and Koleo losses to learn robust visual representations from massive unlabeled datasets.
- It introduces a novel Gram anchoring technique for spatial feature regularization, yielding measurable improvements (e.g., +2 mIoU on segmentation benchmarks) without sacrificing global metrics.
- The model employs a scalable ViT-7B architecture with innovative training and distillation protocols, achieving state-of-the-art performance in global, dense, and multimodal vision tasks.
Self-Distillation with No Labels v3 (DINOv3) is a self-supervised vision foundation model designed to learn high-quality visual representations from massive image datasets without manual annotation or task-specific adaptation. DINOv3 integrates comprehensive scaling strategies, a novel Gram anchoring method for feature regularization, and post-hoc model tuning protocols, resulting in state-of-the-art global, dense, and multimodal performance across a variety of vision tasks, often without any fine-tuning (Siméoni et al., 13 Aug 2025).
1. Self-Distillation Training Paradigm
DINOv3 is built on a teacher–student self-distillation framework, extending the protocol introduced in DINOv2. In each training iteration, a single image yields two global crops (256×256) and eight local crops (112×112); all ten are processed by the student. The objective aggregates three components:
- The DINO loss applies a SwAV-style Sinkhorn clustering to the class tokens across global crops, matching prototype assignments between student and teacher outputs.
- The iBOT reconstruction loss is computed at the patch level on masked local crops, aligning student predictions to normalized teacher patch features.
- The Koleo regularizer distributes class-token embeddings uniformly over the hypersphere.
The combined objective is
The teacher network parameters are maintained as an exponential moving average (EMA) of the student weights and are fixed for loss computations.
2. Scaling Datasets and Model Architectures
Data Curation
DINOv3 leverages over 17 billion content-moderated Instagram images. For training efficiency and diversity, several curated pools are sampled:
- Raw Pool: Unmodified set.
- Clustering-Balanced ('LVD-1689M'): Hierarchical k-means on DINOv2 embeddings, sampling from 25,000 clusters to yield 1,689M balanced images.
- Retrieval Subset: Images nearest to seed datasets (ImageNet, etc.) in embedding space.
- Supervised Pool: ImageNet-1k, ImageNet-22k, Mapillary samples.
During training, 10% of batches are homogeneous (ImageNet-1k) and 90% are heterogeneous draws. Ablation demonstrates mixed curation achieves optimal results in global and retrieval metrics.
Model Architecture
The backbone utilizes a custom Vision Transformer (ViT-7B): 40 transformer blocks, 4096 embedding dimension, patch size 16, SwiGLU feed-forward with 8192 hidden dim, 32 heads of dim 128. Rotary positional embeddings (RoPE) with random box jitter in , , enhance aspect-ratio resilience. Four register tokens are prepended for background outlier absorption.
Optimization Protocol
Training proceeds with batch size 4096 across 256 GPUs, constant learning rate (4e-4), weight decay (0.04), and EMA momentum (0.999) following a 100k step warmup. AdamW optimizer in multi-crop mode over 1M iterations exposes the model to 2.56M unique images.
3. Gram Anchoring: Spatial Feature Regularization
Degradation of Dense Features
Extended self-distillation training without regularization degrades the utility of dense representations (e.g., segmentation, depth estimation)—linear probe accuracy may improve, yet patch-wise feature maps lose localization and structure.
Gram Anchoring Method
Gram anchoring introduces an additional objective to preserve patch-patch similarity: Given L2-normalized student patch matrix (global crop) and a Gram teacher (early EMA snapshot),
This aligns similarity maps rather than absolute feature positions.
Integration and High-Resolution Teacher Features
After 1M pre-training iterations, a refinement phase incorporates (weight 2), updating the Gram teacher every 10k steps. Model performance on dense tasks markedly improves (ADE20k +2 mIoU, better depth RMSE) with no detriment to global metrics. Feeding a twofold higher-resolution image to the teacher, followed by bicubic downsampling, further enhances dense feature quality (+2 mIoU on ADE20k).
4. Post-Training Adaptation Protocols
High-Resolution Feature Adaptation
Fine-tuning for 10k steps on mixed global crop sizes and local crops with continued Gram anchoring ensures the backbone generalizes from standard (224) up to very high resolutions (4096), producing sharp feature maps and steadily improving mIoU.
Model Distillation
A frozen ViT-7B backbone is used to distill representations into a suite of smaller models (ViT-S, ViT-B, ViT-L, ViT-H+, ConvNeXt variants). Loss computation mirrors pre-training but omits EMA; an efficient multi-student distillation setup allows for teacher-cost sharing and batch loss parallelization.
Vision–Text Alignment
Text alignment applies a contrastive image–text loss (LiT protocol) with a frozen vision encoder and a new text encoder. Two transformer layers are appended on the vision side; mean-pooled patch and CLS token representations are concatenated. This enables zero-shot global and local retrieval akin to CLIP.
5. Experimental Results and Comparisons
Dense Representation Quality
DINOv3 surpasses previous models (DINOv2, SigLIP2, PEspatial) in feature quality (PCA visualization). Dense linear probes yield ADE20k 55.9 mIoU (+6 over DINOv2), Cityscapes 81.1, VOC 86.6, and best depth RMSE on NYU 0.309, KITTI 2.346. Keypoint matching and unsupervised object discovery set SOTA on NAVI, SPair, and VOC/COCO datasets. Video instance tracking (DAVIS J&F 83.3, +6.7 over DINOv2) and video classification (UCF101 93.5) are on par with leading models.
Global Representations
Results include 88.4% ImageNet top-1, 81.4% V2, 90.4% ReaL, with best-in-class out-of-distribution robustness (ImageNet-R 91.1, Sketch 71.3, Adversarial 86.9, ObjectNet 79.0, ImageNet-C mCE 19.6).
System-Level Downstream Tasks
Tables summarize frozen backbone performance on core benchmarks:
| Task | Dataset | Metric |
|---|---|---|
| Detection | COCO | 65.6 mAP |
| Segmentation | ADE20k | 63.0 mIoU |
| Depth Estimation | NYU | 4.3 %Rel |
| 3D Pose/Matching | Re10K/ScanNet | 86.3 %/35.2 % |
SOTA is observed in multiple segmentation, detection, and depth metrics without end-to-end fine-tuning.
6. Ablation Studies and Technical Analyses
Systematic ablation reveals:
- Mixed data curation is optimal for global, linear, and kNN retrieval tasks.
- Layerwise analysis shows global features peak at the final block, while geometric tasks peak near block 32.
- Register tokens eliminate high-norm patch outliers without impacting accuracy, outperforming attn-bias and value gating techniques.
- Early Gram teacher snapshots (100–200k iters) outperform late snapshots for dense feature regularization.
- Twofold downsampling of high-res teacher features maximizes dense task gains.
7. Implementation, Computational Considerations, and Domain Adaptation
Training Infrastructure and Efficiency
Pre-training employs 4096 batch size, AdamW optimizer (), bfloat16, 8-bit matmuls, constant LR (4e-4), WD (0.04), and 1M iterations plus 100k warmup. Gram anchoring uses weight=2, teacher update every 10k, max 3 updates. Multi-student distillation utilizes NCCL all-gather for aggregation and parallel backpropagation.
Resource and Environmental Impact
ViT-7B pre-training on H100 GPUs accounts for 61k GPU hours (47 MWh, ≈18 t CO₂eq). All distillation and downstream activities total ~9M GPU-hours (≈2600 t CO₂eq).
Domain Transferability
To adapt DINOv3 to new domains, swap in the target unlabeled corpus, adjust normalization, and maintain core hyperparameters. For metric-based tasks, Gram anchoring or lightweight decoder fine-tuning on synthetic data is recommended. Performance in remote sensing (e.g., on 493M Maxar satellite images) demonstrates state-of-the-art results for canopy height and semantic segmentation using the identical protocol.
DINOv3 establishes a unified, scalable, and robust self-supervised paradigm—combining teacher–student distillation, the Gram anchoring regularizer, and flexible post-training adaptation. It yields frozen vision backbones (up to 7B parameters) providing state-of-the-art performance for dense, global, multimodal, and domain-specific applications without end-to-end fine-tuning (Siméoni et al., 13 Aug 2025).