Vision Transformers Don't Need Trained Registers

Published 9 Jun 2025 in cs.CV and cs.AI | (2506.08010v4)

Abstract: We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers -- the emergence of high-norm tokens that lead to noisy attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-LLMs to improve their interpretability. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a training-free, test-time register method that mitigates high-norm tokens in Vision Transformers.
It leverages a sparse set of register neurons to shift outlier tokens, enhancing attention map clarity across classification, segmentation, and object discovery tasks.
Experimental results show that this approach maintains or improves performance and robustness, even in multimodal and adversarial settings.

Vision Transformers Don't Need Trained Registers

Introduction

The paper "Vision Transformers Don't Need Trained Registers" addresses the formation of high-norm tokens in the internal computation of Vision Transformers (ViTs). These tokens create artifacts in attention maps, affecting downstream visual processing. The traditional method involves retraining models with register tokens to mitigate these artifacts, but this requires starting the training process anew, limiting practical application. The paper proposes a training-free approach, leveraging a sparse set of neurons responsible for outlier formation, referred to as register neurons. By intervening on these neurons during test time, the paper claims models can achieve performance comparable to those with trained registers across various tasks.

Mechanism of High-Norm Tokens

High-norm tokens, or outlier patches, typically emerge after the MLP block within ViTs, such as in OpenCLIP ViT-B/16.

Figure 1: Outlier patches appear after MLPs; attention sinks appear after outlier patches.

A small subset of neurons consistently shows high activations before these outlier patches (Figure 2), indicating their key role in their formation.

Figure 2: Neuron activation distributions differ between outlier and non-outlier patches.

These neurons, identified as register neurons, activate across various outlier locations, not being position-specific (Figure 3).

Figure 3: Highly activated neurons on the top outlier activate on all outlier positions.

Implementation of Test-Time Registers

By harnessing register neurons, the paper introduces a test-time intervention method to shift outliers to arbitrary positions or to an added test-time register token.

To achieve this, for each register neuron, the highest activation across all tokens is copied to specific positions, and other activations are cleared. This strategy effectively controls the emergence of high-norm tokens and can shift them outside the image area (Figure 4).

Figure 4: Intervening on activations of register neurons effectively shifts outliers to random patches and test-time registers.

Test-time registers can mimic the behavior of learned registers without needing retraining, absorbing high norms and yielding clean attention maps, as demonstrated in DINOv2 (Figure 5). Test-time registers hold global information, verified by linear probing on various datasets.

Figure 5: Qualitative results on attention maps w/ test-time registers.

Experimental Evaluation

Classification and Dense Prediction:

Models with test-time registers maintain or improve performance metrics on ImageNet classification, ADE20k segmentation, and NYUv2 depth estimation (Figure 4).

Zero-Shot Segmentation:

Attention maps become more interpretable, with test-time registers boosting mean IOU and mAP in segmentation tasks.

Unsupervised Object Discovery:

Test-time registers significantly improve performance in object discovery tasks by refining attention feature maps.

Application to Vision-LLMs

The paper extends the application of test-time registers to vision-LLMs, such as LLaVA-Llama-3-8B. Here, the introduction of a test-time register enhances interpretability without affecting performance across multimodal benchmarks, mitigating artifacts in cross-modal attention maps (Figure 6).

Figure 6: Test-time registers improve interpretability of LLaVA-Llama-3-8B.

Robustness to Typographic Attacks

By focusing high-norm tokens on text locations, test-time registers can mask adversarial text without affecting semantic content. This technique proves effective against typographic attacks, where traditional interventions fail, highlighting its robustness and practical applicability (Figure 7).

Figure 7: Qualitative results on typographic attacks.

Conclusion

The findings suggest that test-time registers offer a viable alternative to trained registers, providing a practical, cost-effective solution to mitigate high-norm artifacts in Vision Transformers without retraining. This methodology not only simplifies deployment in vision models but also maintains interpretability and robustness across various tasks, paving the way for future advancements in AI model architecture and interventions.

Markdown Report Issue