Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks

Published 8 Jun 2017 in cs.LG and stat.ML | (1706.02690v5)

Abstract: We consider the problem of detecting out-of-distribution images in neural networks. We propose ODIN, a simple and effective method that does not require any change to a pre-trained neural network. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions between in- and out-of-distribution images, allowing for more effective detection. We show in a series of experiments that ODIN is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is 95%.

Abstract PDF Upgrade to Chat

Citations (1,880)

View on Semantic Scholar

Summary

The paper introduces the ODIN method, which combines temperature scaling and input preprocessing to enhance the detection of out-of-distribution images.
It leverages calibrated softmax scores to significantly reduce false positives, as shown by a drop in FPR from 34.7% to 4.3% on DenseNet with CIFAR-10 versus TinyImageNet.
Its simple yet effective approach makes it easily deployable on pre-trained networks, improving safety in real-world applications like autonomous driving and healthcare.

Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks

Authors: Shiyu Liang, Yixuan Li, R. Srikant

Overview

The presented paper addresses the challenge of detecting out-of-distribution (OOD) images within neural networks using a novel approach called ODIN (Out-of-DIstribution detector for Neural networks). The significance of ODIN lies in its simplicity and effectiveness, requiring no modifications to pre-trained neural networks. This method leverages temperature scaling and input preprocessing to enhance the detection of OOD images, thereby improving the reliability of neural networks in real-world applications where test data distributions may differ significantly from the training distribution.

Methodology

The authors propose a straightforward method combining temperature scaling and input perturbation to distinguish between in-distribution and OOD images effectively. The key components are discussed as follows:

Temperature Scaling: The softmax scores produced by a neural network are calibrated using a temperature parameter (T). By increasing T, the softmax outputs' entropy is raised, thereby amplifying the disparity between in-distribution and OOD images. This is accomplished using:

$S_i(\bm{x}; T) = \frac{\exp\left(\frac{f_i(\bm{x})}{T}\right)}{\sum_{j=1}^N \exp\left(\frac{f_j(\bm{x})}{T}\right)}$

where $S_i$ is the softmax probability for class i, $f_i$ denotes the logits, and T is the temperature parameter.

Input Preprocessing: A small perturbation is added to the input:

$\tilde{\bm{x}} = \bm{x} - \epsilon \cdot \text{sign}\left(-\nabla_{\bm{x}} \log S_{\hat{y}}(\bm{x}; T)\right)$

Here, $\epsilon$ is the perturbation magnitude, and $S_{\hat{y}}$ represents the softmax probability for the predicted class. This preprocessing step is similar in spirit to adversarial techniques but is used to increase the softmax score for more pronounced separation.

Results

ODIN was benchmarked against the baseline method proposed by \citeauthor{hendrycks2016baseline}. The evaluations utilized state-of-the-art network architectures such as DenseNet and Wide ResNet across a variety of in- and out-of-distribution datasets including CIFAR-10, CIFAR-100, TinyImageNet, and LSUN. The key findings are:

Performance Metrics: ODIN outperforms the baseline by large margins across multiple metrics including False Positive Rate (FPR) at 95% True Positive Rate (TPR), Detection Error, AUROC, and AUPR.
- For instance, on DenseNet with CIFAR-10 as in-distribution and TinyImageNet (crop) as OOD, ODIN achieved a FPR reduction from 34.7% to 4.3%.
- On CIFAR-100, ODIN significantly improved detection performance, bringing down FPR and achieving better AUROC and AUPR scores.
Effectiveness Across Architectures: The method proved effective on various architectures suggesting robustness and versatility.

Analysis

The performance evaluation and theoretical grounding provide insights into why ODIN achieves high effectiveness:

Temperature Scaling: The authors derived a Taylor expansion for the softmax function under large temperature scaling, showing that a sufficiently high temperature reduces the impact of secondary outputs from non-predicted classes, primarily leaving the influence of the maximum logit.
Gradient Norms: Empirical observations revealed that in-distribution images typically have higher norm of softmax gradient compared to OOD images, which supports the efficacy of the input preprocessing step.

Implications and Future Work

The implications of improved OOD detection are vast, including enhanced safety and reliability in critical applications such as autonomous driving and healthcare diagnostics. The method's simplicity, requiring only post-processing of existing networks, makes it highly valuable for deployment in existing AI systems without the need for retraining.

Future directions may include:

Generalization Across Domains: Testing ODIN's applicability in domains beyond image classification such as natural language processing and speech recognition.
Combining with Other Techniques: Integrating ODIN with other uncertainty modeling approaches to further improve detection robustness.
Scalability: Examining ODIN's performance on larger and more diverse datasets to understand its limitations and potential adaptations.

Conclusion

The paper presents a method that significantly enhances the detection of OOD images in neural networks through a combination of temperature scaling and input preprocessing. ODIN's simplicity and effectiveness mark a notable advancement in ensuring the reliability of deep learning models in practical, real-world scenarios.

Markdown Report Issue