Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection

Published 24 Jul 2025 in cs.CV | (2507.18481v1)

Abstract: Anomaly detection in medical images is an important yet challenging task due to the diversity of possible anomalies and the practical impossibility of collecting comprehensively annotated data sets. In this work, we tackle unsupervised medical anomaly detection proposing a modernized autoencoder-based framework, the Q-Former Autoencoder, that leverages state-of-the-art pretrained vision foundation models, such as DINO, DINOv2 and Masked Autoencoder. Instead of training encoders from scratch, we directly utilize frozen vision foundation models as feature extractors, enabling rich, multi-stage, high-level representations without domain-specific fine-tuning. We propose the usage of the Q-Former architecture as the bottleneck, which enables the control of the length of the reconstruction sequence, while efficiently aggregating multiscale features. Additionally, we incorporate a perceptual loss computed using features from a pretrained Masked Autoencoder, guiding the reconstruction towards semantically meaningful structures. Our framework is evaluated on four diverse medical anomaly detection benchmarks, achieving state-of-the-art results on BraTS2021, RESC, and RSNA. Our results highlight the potential of vision foundation model encoders, pretrained on natural images, to generalize effectively to medical image analysis tasks without further fine-tuning. We release the code and models at https://github.com/emirhanbayar/QFAE.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an unsupervised Q-Former Autoencoder that combines pretrained encoders, a Q-Former bottleneck, and perceptual loss for semantically-guided medical anomaly detection.
It leverages features from models like DINOv2 and Masked Autoencoder to achieve state-of-the-art AUROC scores of 94.3% on BraTS2021 and 83.8% on RSNA.
The framework emphasizes scalability via frozen pretrained models while addressing computational demands and potential adaptation challenges for diverse clinical applications.

Q-Former Autoencoder for Medical Anomaly Detection

Introduction

The paper introduces "Q-Former Autoencoder: A Modern Framework for Medical Anomaly Detection," which presents an unsupervised autoencoder-based model integrated with cutting-edge pretrained vision models to enhance anomaly detection in medical imaging. The proposed architecture leverages foundation models like DINO, DINOv2, and Masked Autoencoder to extract features, utilizing a Q-Former as the bottleneck to maintain high-level feature representation and applying perceptual loss for semantically guided anomaly detection.

Methodology

Architecture

The Q-Former Autoencoder (QFAE) architecture comprises three main components: a pretrained encoder, a Q-Former as the bottleneck, and a lightweight Transformer-based decoder. The pretrained models, serving as encoders, capture rich feature representations without domain-specific fine-tuning. The Q-Former aggregates multi-scale features and provides a fixed-length output regardless of input variability. This, coupled with perceptual loss guided by a Masked Autoencoder, focuses on semantic rather than pixel-level detail, refining the network's ability to generalize anomalies beyond what standard MSE-driven autoencoders accomplish.

Encoder: Utilizes pretrained vision models like DINOv2 to extract semantically rich features.
Q-Former Bottleneck: Acts as a selective bottleneck controlling reconstruction granularity by adapting the feature representation from the encoder.
Decoder: Transforms the bottleneck output back into the image space, emphasizing reconstructive accuracy.
Figure 1: We illustrate the traditional autoencoder for anomaly detection (top) versus our Q-Former Autoencoder enhanced with Q-Former and perceptual loss (bottom).

Training and Loss

The model is trained using perceptual loss derived from features of a pretrained Masked Autoencoder, which aligns reconstruction toward meaningful structures rather than purely minimization of pixel-level error. This approach provides nuanced detection of anomalies, capitalizing on differences in high-level semantics that are characteristic of normal versus abnormal imagery.

Figure 2: The training of our Q-Former Autoencoder for medical anomaly detection.

Results

The QFAE was evaluated across diverse benchmarks including BraTS2021, RESC, and RSNA datasets, showcasing significant improvements in anomaly detection capability, evidenced by its state-of-the-art AUROC scores: 94.3% for BraTS2021 and 83.8% for RSNA. These results reflect the architecture's effectiveness in capturing and differentiating subtle semantic nuances across various imaging modalities, all without additional domain-specific tuning.

Figure 3: Qualitative examples of anomaly localization on several samples from the BraTS2021.

Implementation Considerations

Scalability and Resource Requirements

The model leverages frozen pretrained encoders, reducing comprehensive training resource constraints. However, the necessity to compute high-dimensional perceptual loss implies significant memory usage, particularly for multi-scale feature extraction. Deployment when translating to real-world systems necessitates consideration of computational power, especially within clinical settings where real-time processing may be imperative.

Limitations

The model's reliance on pretrained vision encoders, while democratizing feature extraction, does not cater to every domain nuance without imaging diversity inherent in foundational model training sets. Processing efficiency and robustness across unexpected anomaly variations require ongoing exploration and possible architectural adaptations to expand its clinical application viability.

Conclusion

The Q-Former Autoencoder represents an advanced leap in unsupervised anomaly detection, emphasizing pretrained models for efficient feature extraction and perceptual loss for semantically driven training. Though demonstrating exceptional performance across multiple datasets, adaptation challenges and computational demands highlight areas for future development. Integration with ongoing advancements in foundational vision models promises even broader applicability across more diverse medical domains.

Markdown Report Issue