Learning to Compare Image Patches via Convolutional Neural Networks

Published 14 Apr 2015 in cs.CV, cs.LG, and cs.NE | (1504.03641v1)

Abstract: In this paper we show how to learn directly from image data (i.e., without resorting to manually-designed features) a general similarity function for comparing image patches, which is a task of fundamental importance for many computer vision problems. To encode such a function, we opt for a CNN-based model that is trained to account for a wide variety of changes in image appearance. To that end, we explore and study multiple neural network architectures, which are specifically adapted to this task. We show that such an approach can significantly outperform the state-of-the-art on several problems and benchmark datasets.

Abstract PDF Upgrade to Chat

Citations (1,423)

View on Semantic Scholar

Summary

The paper demonstrates that CNNs can effectively learn similarity functions for image patches, surpassing traditional methods like SIFT.
It explores diverse architectures—2-channel, Siamese, and pseudo-Siamese—highlighting trade-offs in performance and computational efficiency.
Experimental results on benchmark datasets confirm that learned descriptors improve matching accuracy in various computer vision tasks.

Learning to Compare Image Patches via Convolutional Neural Networks

The paper "Learning to Compare Image Patches via Convolutional Neural Networks" by Sergey Zagoruyko and Nikos Komodakis addresses the imperative task of automatically learning a similarity function for image patches. This task is fundamental in computer vision and underpins numerous applications ranging from low-level tasks like image super-resolution to high-level tasks such as object recognition.

Overview

The authors argue against the long-established paradigm of manually designed feature descriptors such as SIFT, which although impactful, are suboptimal in handling a wide variety of transformations affecting patch appearances. By leveraging the availability of large annotated datasets of patch correspondences, the authors propose to learn a similarity function directly from the image data using convolutional neural networks (CNNs).

Methodology

The key innovation lies in exploring a variety of neural network architectures adapted to the task of patch comparison. Three primary architectures are scrutinized: 2-channel, Siamese, and Pseudo-Siamese. Each architecture offers different trade-offs in performance efficiency, suggesting a thorough evaluation to identify the structure best suited for optimal patch comparison.

2-Channel Network: This architecture processes the two patches jointly right from the first layer. The network treats the two patches as a 2-channel image, thereby incorporating feature interactions from the outset.
Siamese Network: Comprising two branches with shared weights, this network first computes descriptors independently for each patch, which are then used to estimate similarity.
Pseudo-Siamese Network: Similar to the Siamese network but without shared weights between branches, allowing for greater flexibility while retaining some efficiency benefits during test time.

Additionally, the authors propose enhanced network architectures, including deep networks and multi-resolution (central-surround two-stream) networks. The Spatial Pyramid Pooling (SPP) network permits the handling of input patches of arbitrary sizes, mitigating the need for resizing and thereby preserving more spatial resolution in comparisons.

Results

The experimental evaluation is extensive and convincing. The networks were tested on multiple benchmark datasets, most notably the local image patches benchmark dataset and the Mikolajczyk dataset.

Local Image Patches Benchmark: The 2-Channel and its variants exhibited superior performance, significantly outperforming state-of-the-art approaches. The best-performing model (2ch-2stream) achieved an FPR95 score more than twice as good as the previous best. This model also outperformed SIFT by a large margin.
Wide Baseline Stereo Evaluation: When evaluated on the Strecha dataset, the networks demonstrated robustness in computing photometric costs for stereo matching, yielding high-quality depth maps. Siamese-based networks, particularly the two-stream, showed a significant reduction in deviation from ground truth depth maps.
Local Descriptors Evaluation: On the Mikolajczyk dataset, the networks were competitive with or surpassed established methods, further asserting the efficacy of the proposed neural network architectures.

Implications and Future Work

The implications of this research are profound. By successfully learning similarity functions directly from image data, the authors provide a method that can adaptively handle various sources of variation in image patches—be it changes in viewpoint, illumination, or occlusions—without the need for manually crafted feature descriptors. The described architectures and methodologies set a new standard in the field and pave the way for future advancements.

Moving forward, several intriguing avenues present themselves. A crucial point is the enhancement of evaluation efficiency for 2-channel architectures, which currently incur high computational costs at test time. Further, leveraging larger datasets for training would undoubtedly refine the model's performance. Combining these learned similarity models with more sophisticated global optimization techniques promises even higher accuracy, particularly in tasks such as depth estimation and feature matching.

In conclusion, this paper solidifies the utility and superiority of learned similarity functions via CNNs for image patch comparison. The comprehensive investigation into various architectures provides a robust foundation for future research and development in computer vision applications.