DINOv3

Published 13 Aug 2025 in cs.CV and cs.LG | (2508.10104v1)

Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a scalable self-supervised learning framework using Gram anchoring to preserve dense feature consistency while creating universal vision encoders.
It integrates composite loss functions, multi-crop augmentations, and high-resolution adaptation, leveraging a massive, curated dataset for robust global and dense task performance.
The work demonstrates efficient distillation and domain generalization, enabling SOTA applications from object detection to remote sensing without task-specific fine-tuning.

DINOv3: Scalable Self-Supervised Vision Foundation Models with Gram Anchoring

Introduction and Motivation

DINOv3 advances self-supervised learning (SSL) for vision foundation models by scaling both dataset and model size, introducing novel regularization for dense features, and providing a suite of distilled models for diverse deployment scenarios. The work demonstrates that SSL, when properly scaled and regularized, can match or surpass weakly- and fully-supervised approaches on both global and dense vision tasks, without requiring fine-tuning or task-specific adaptation. DINOv3 is positioned as a universal visual encoder, capable of robust generalization across domains, including natural and aerial imagery.

Figure 1: (a) Linear probing accuracy on ImageNet1k over time for SL, WSL, and SSL methods; (b) DINOv3 dense task performance vs. WSL; (c,d) PCA maps of DINOv3 features for natural and aerial images.

Data Scaling and Curation

DINOv3 leverages a massive, curated dataset (LVD-1689M) constructed from 17B Instagram images, using hierarchical $k$ -means clustering and retrieval-based sampling to ensure both diversity and relevance for downstream tasks. The data pipeline mixes curated, retrieval, and raw datasets, with a batch sampling strategy that includes homogeneous ImageNet1k batches for optimization. Ablation studies confirm that this hybrid curation yields superior downstream performance compared to single-method curation.

Model Architecture and Training

The main DINOv3 backbone is a custom ViT-7B (6.7B params, 40 blocks, patch size 16, axial RoPE positional embeddings with jittering), trained with a composite SSL objective: global DINO loss, local iBOT loss, and distributed Koleo regularization. Training is performed with constant hyperparameters and multi-crop augmentation, enabling indefinite scaling and stability. Register tokens are used to mitigate high-norm patch outliers, and layer normalization is applied to backbone outputs for both local and global crops, improving both kNN and dense task metrics.

Gram Anchoring: Regularization for Dense Features

Extended SSL training improves global metrics but degrades dense feature quality due to loss of patch-level consistency. DINOv3 introduces Gram anchoring, a regularization phase that aligns the Gram matrix (pairwise patch similarities) of the student to that of an early-stage teacher (Gram teacher), using the loss:

$\mathcal{L}_{\text{Gram}} = \left\| \mathbf{X}_S \mathbf{X}_S^\top - \mathbf{X}_G \mathbf{X}_G^\top \right\|_F^2$

where $\mathbf{X}_S$ and $\mathbf{X}_G$ are $L_2$ -normalized patch features from student and teacher, respectively. This loss is applied post-hoc, after 1M iterations, and the Gram teacher is periodically updated. High-resolution Gram anchoring further improves dense feature quality by using teacher features from upsampled images, then downsampling to match student output.

Figure 2: Evolution of cosine similarities and task accuracy for ViT-g and ViT-7B; segmentation peaks when patch-class similarities are low, then degrades as similarities increase.

Figure 3: Gram matrices at different input resolutions; downsampling high-res features preserves patch-level consistency.

Figure 4: Cosine similarity maps before and after Gram anchoring; refinement objective $\mathcal{L}_\mathrm{HRef}$ yields cleaner, more localized features.

Post-Training: Resolution Adaptation and Distillation

A high-resolution adaptation phase enables DINOv3 to generalize across input sizes, using mixed-resolution crops and Gram anchoring. Empirically, this step is essential for maintaining dense feature quality at high resolutions, with models supporting inference up to $4096 \times 4096$ pixels.

Distillation transfers knowledge from the 7B teacher to smaller ViT and ConvNeXt variants, using a multi-student pipeline that shares teacher inference across GPUs for efficiency. Distilled models (ViT-S, B, L, H+, CNX-T/B/L) achieve performance close to the teacher, with ViT-H+ nearly matching ViT-7B despite 10x fewer parameters.

Figure 5: Multi-student distillation: teacher inference shared across all nodes, students trained in parallel with synchronized groups.

Figure 6: DINOv3 family of models: parameter counts and FLOPs for ViT and ConvNeXt variants.

Dense Feature Quality and Stability

DINOv3 produces high-quality, stable dense features across resolutions, outperforming both self- and weakly-supervised baselines (DINOv2, SigLIP2, PEspatial, AM-RADIO) on segmentation, depth estimation, 3D correspondence, object discovery, and video tracking. Dense features are visualized via PCA, showing sharp, semantically coherent maps with minimal noise.

Figure 7: Cosine similarity maps for $4096 \times 4096$ input; DINOv3 features are highly localized and consistent.

Figure 8: PCA visualization of dense features at increasing resolutions; DINOv3 maintains semantic structure and crispness.

Figure 9: Feature stability across resolutions for ViT-S, S+, B, L, H+; features remain consistent before drifting at extreme sizes.

System-Level Applications

DINOv3 serves as a frozen backbone for state-of-the-art systems in object detection (Plain-DETR), semantic segmentation (Mask2Former + ViT-Adapter), monocular depth estimation (Depth Anything v2), and 3D scene understanding (VGGT). In all cases, DINOv3-based systems match or exceed prior SOTA, often with fewer trainable parameters and no backbone fine-tuning.

Domain Generalization: Geospatial and Remote Sensing

DINOv3 is applied to satellite imagery (SAT-493M, Open-Canopy), achieving SOTA in canopy height estimation, semantic segmentation, and object detection, outperforming domain-specific models (Prithvi-v2, DOFA) even with RGB-only input. Both web- and satellite-pretrained DINOv3 models generalize well, with domain-specific pretraining yielding best results for metric tasks.

Figure 10: DINOv3 features and segmentation for remote sensing; PCA maps show finer details than DINOv2, segmentation and canopy height prediction performed with frozen backbone.

Figure 11: Qualitative comparison of DINOv3 7B satellite model to prior work on Open-Canopy; DINOv3 yields more accurate height maps.

Zero-Shot and Multimodal Alignment

A text encoder is trained via LiT-style contrastive alignment to DINOv3 features, enabling zero-shot classification and open-vocabulary segmentation. DINOv3-based dino.txt achieves competitive global alignment and SOTA dense alignment, outperforming CLIP and EVA-02-CLIP on segmentation tasks.

Implementation Considerations

Compute: Training ViT-7B requires 61,440 GPU hours (H100), with a carbon footprint of ~18 tCO $_2$ eq per model.
Scaling: Gram anchoring and register tokens are essential for stability and dense feature quality at scale.
Distillation: Multi-student distillation is efficient and enables deployment across resource budgets.
Resolution: High-res adaptation and RoPE positional embeddings allow inference at arbitrary resolutions.
Domain Transfer: SSL recipe is generic; domain-specific pretraining improves metric tasks, but web-pretrained models generalize well for semantic tasks.

Implications and Future Directions

DINOv3 demonstrates that SSL, when scaled and regularized, can produce universal vision encoders with robust, high-quality dense and global features. The Gram anchoring method resolves a key limitation of prior SSL scaling, enabling indefinite training without dense feature collapse. The model family supports deployment from edge devices to large-scale servers, and the approach generalizes to specialized domains such as remote sensing.

Future work may explore:

Further scaling of model and data size, leveraging unlabeled data from diverse domains.
Integration of multimodal alignment during pretraining, rather than post-hoc.
Efficient quantization and deployment strategies for transformer-based vision models.
Application to lifelong learning and continual adaptation scenarios.
Extension to video and 3D modalities, leveraging DINOv3's strong temporal and geometric consistency.

Conclusion

DINOv3 sets a new standard for self-supervised vision foundation models, achieving SOTA on dense and global tasks with a frozen backbone, scalable architecture, and robust regularization. The Gram anchoring technique is critical for maintaining dense feature quality at scale, and the distilled model family enables practical deployment. The approach generalizes across domains and tasks, supporting both universal and specialized applications in computer vision.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to be concrete and actionable for future work.

Data curation, composition, and ethics

Quantify selection bias from using Instagram-only web data (e.g., geography, demographics, topics) and its downstream impact on fairness and domain generalization; provide bias audits across protected attributes and content types.
Reproducibility of the data pipeline: the curated 1.689B-image set (LVD-1689M) is not available; specify exact filtering, deduplication, near-duplicate thresholds, and sampling seeds, or release an open substitute and its statistics.
Ablate the 10% homogeneous ImageNet-1k batch ratio: sweep ratios and assess trade-offs on global vs dense tasks, OOD robustness, and domain transfer; test removal of ImageNet-1k entirely.
Analyze the effect of clustering vs retrieval curation at scale beyond 200k steps (full 1M schedule), including how each scales with model size, and whether mixture benefits persist for larger backbones.
Provide a detailed breakdown of data diversity/coverage (object/scene taxonomies), duplication rates, and content moderation categories retained/removed.
Clarify legal/ethical considerations and privacy safeguards (consent, memorization risk, personal data leakage) and quantify memorization via membership inference or canary exposure tests.

Gram anchoring method

Formalize the Gram anchoring objective: exact loss definition, normalization, layers/features used, temperature, weighting schedules, and computational cost; provide pseudocode to ensure reproducibility.
Theoretical explanation: why and when Gram anchoring mitigates patch-level inconsistency; relate to optimization dynamics (eigenspectrum, attention entropy, CLS dominance) and provide causal evidence.
Sensitivity analysis: sweep Gram loss weight, which layers to anchor (early/mid/late), anchor frequency, and teacher snapshot cadence; measure Pareto frontier between global accuracy and dense quality.
Teacher choice for anchoring: justify using early snapshots vs EMA teachers vs checkpoints from different training stages; compare single static teacher vs rolling teacher and their compute/latency trade-offs.
Evaluate potential lock-in of early-teacher biases/errors: does anchoring prevent beneficial representation drift for rare/long-tail concepts?
Generality across objectives: test Gram anchoring with other SSL families (e.g., MAE/JEPA, VICReg/L) and with supervised/weakly supervised pretraining to assess method universality.
Robustness to resolution/AR changes under Gram anchoring: verify that anchoring does not overfit to specific crop statistics or harm extreme-resolution behavior.

Architecture and optimization

Positional embeddings: ablate axial RoPE and box-jittering ranges (e.g., s ∈ [0.25, 3]) and their effect on resolution/AR extrapolation, dense tasks, and metric-sensitive geometry tasks.
Register tokens: quantify the contribution and optimal number/placement of register tokens for dense features; compare with register-free techniques or learned feature adapters.
Patch size 16 vs 14: control for token budget and isolate the impact on dense/local detail vs throughput; test mixed patch sizes or hybrid hierarchical tokenization.
Constant schedules: provide head-to-head comparisons with cosine schedules and other long-horizon schedulers across total steps, including convergence speed, stability, and compute efficiency.
Koleo regularizer: ablate weight, batch size (local vs global), and interaction with Gram anchoring; measure its effects on feature isotropy and clustering behavior.
Training stability at scale: report failure modes, collapse indicators, and monitoring signals; share intervention strategies (e.g., temperature schedules) to avoid late-stage degradation.

Distillation and model family

Single-teacher multi-student distillation: specify losses (e.g., cosine, feature matching, logits), temperatures, layer mapping strategies, and training data; release ablations on preserving dense feature quality.
Measure fidelity: quantify how much dense and global performance is lost from 7B→Small/Base/Large; provide feature-similarity metrics and downstream gaps by task.
Data for distillation: test in-domain vs out-of-domain and curriculum strategies; evaluate whether student inherits the teacher’s resolution scalability and Gram improvements.
Architecture diversity: evaluate how well ConvNeXt vs ViT students retain the teacher’s dense properties; test cross-architecture anchoring.

Evaluation breadth and protocols

Decoder dependence: for “frozen backbone SOTA,” report decoder/head capacity, training budget, and standardized protocols; ablate light vs heavy heads to isolate backbone contributions.
Dense tasks scope: extend evaluation to optical flow, stereo, depth (indoor/outdoor), pose estimation, SLAM/keypoint matching, and high-res instance/semantic segmentation; quantify small-object/edge-detail performance.
Cross-domain generalization: systematically assess transfer to medical, histopathology, biology, and remote sensing beyond the single satellite case; report negative transfer and data-mixing strategies for robustness.
OOD robustness: expand beyond ObjectNet to distributional shifts (ImageNet-C/A/R, synthetic corruptions, weather, viewpoint) and measure calibration and abstention behavior.
High-resolution consistency: provide quantitative tests for tiling/cropping invariance, patch-boundary artifacts, and multi-scale consistency at 2k–8k resolutions.
Comparisons with WSL/multimodal baselines: ensure consistent training/evaluation budgets and decoders; include recent PE/SigLIP2/AM-RADIO dense variants under identical protocols.

Robustness, safety, and security

Adversarial robustness: evaluate Lp-bounded and patch attacks; analyze whether Gram anchoring hardens or weakens robustness relative to DINOv2/CLIP.
Spurious correlations: test controlled datasets for shortcut reliance; measure subgroup robustness and worst-group accuracy.
Continual/lifelong learning: substantiate the claim by running streaming or incremental benchmarks; test whether constant schedules and Gram anchoring mitigate catastrophic interference.

Efficiency, scalability, and reproducibility

Compute/energy reporting: detail GPU hours, training efficiency, memory footprint, and carbon estimates for 7B at 1M steps; provide scaling laws for accuracy vs tokens/model size/steps.
Inference efficiency: benchmark throughput/latency/memory at high resolutions and with tiling; provide guidance for edge deployment and on-device trade-offs.
Release artifacts: clarify which weights, code, and recipes (including Gram anchoring and distillation) will be released and under what licenses; include seeds and exact configs to enable replication.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (26)

First 10 authors:

Collections

Tweets

YouTube

Show All Videos

alphaXiv

DINOv3 (225 likes, 0 questions)

DINOv3

Summary

DINOv3: Scalable Self-Supervised Vision Foundation Models with Gram Anchoring

Introduction and Motivation

Data Scaling and Curation

Model Architecture and Training

Gram Anchoring: Regularization for Dense Features

Post-Training: Resolution Adaptation and Distillation

Dense Feature Quality and Stability

System-Level Applications

Domain Generalization: Geospatial and Remote Sensing

Zero-Shot and Multimodal Alignment

Implementation Considerations

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Data curation, composition, and ethics

Gram anchoring method

Architecture and optimization

Distillation and model family

Evaluation breadth and protocols

Robustness, safety, and security

Efficiency, scalability, and reproducibility

Open Problems

Continue Learning

Related Papers

Authors (26)

Collections

Tweets

YouTube

alphaXiv