MatchBench: Feature Matcher Benchmark
- The paper introduces MatchBench, a benchmark that comprehensively evaluates feature matchers based on matching ability, correspondence sufficiency, and efficiency.
- It reorganizes popular datasets like TUM, KITTI, and Strecha, covering both short-baseline and wide-baseline scenarios to mimic real-world conditions.
- It employs a rigorous two-view pose estimation pipeline with metrics such as rotational and translational error to objectively assess matcher performance.
MatchBench is a benchmark designed to provide the first uniform, comprehensive evaluation of feature matchers in computer vision. Unlike previous benchmarks that focused solely on individual aspects such as feature detectors or descriptors, MatchBench directly assesses feature matchers—algorithms that output correspondences between image pairs—which are foundational for high-level applications including Structure-from-Motion (SfM) and Visual SLAM. The benchmark evaluates matchers along three primary axes: matching ability (geometric correctness), correspondence sufficiency (number of inlier correspondences), and efficiency (processing runtime). It encompasses diverse scenario types, supporting both short-baseline (SLAM/video) and wide-baseline (SfM) image pairs (Bian et al., 2018).
1. Dataset Organization and Scene Coverage
MatchBench repurposes and reorganizes sequences from established public datasets to ensure comprehensive coverage of various scene types:
- TUM RGB-D (indoor, office settings)
- 01-office: indoor, textured, short baseline.
- 02-teddy: indoor, non-planar, short baseline.
- 03-large-cabinet: indoor, weak-texture, short baseline.
- KITTI Odometry (urban outdoor)
- 04-kitti: street view, high resolution, short baseline.
- Strecha SfM (urban buildings)
- 05-castle: outdoor, wide baseline.
- Subsampled wide-baseline TUM sequences
- 06-office-wide, 07-teddy-wide, 08-large-cabinet-wide: increased viewpoint changes, up to 5 seconds apart.
Each sequence is characterized by the number of images, resolution, total image pairs, and scene attributes (e.g., planarity, texture richness). Short-baseline portions mimic video/odometry scenarios; wide-baseline portions reflect challenging SfM use cases.
2. Evaluation Metrics and Pose Estimation Pipeline
MatchBench employs a two-view pose estimation framework to objectively assess matching quality. The evaluation proceeds as follows:
- Essential Matrix Estimation: Given correspondences and camera intrinsics :
- Pose Decomposition: Relative pose is extracted from via SVD-based decomposition.
- Pose Error Calculation:
- Rotational error:
- Translational error:
- Combined error:
- An image pair is a "correct match" if ( varies from 1° to 10°).
- Matching Ability (Success Ratio & AUC of SP Curve):
- Correspondence Sufficiency (AP Bar):
with for practical reporting.
- Efficiency: Mean runtime (CPU and/or GPU) including detection, matching, and geometric verification (e.g., RANSAC).
3. Experimental Protocol and Workflow
The evaluation protocol precisely mirrors realistic application settings. For short baselines, each video is fragmented (e.g., TUM at frames, KITTI at ) and frames 2... are matched against frame 1 of each segment. Wide-baseline protocols involve matching all pairs in the Strecha sequence and infrequently paired (every ≈5 seconds) frames in TUM.
Keypoint and Descriptor Extraction: Each matcher uses its canonical detector-descriptor pipeline (e.g., SIFT, SURF, ORB, BRISK, KAZE, AKAZE, DLCO, FREAK, BinBoost, LATCH, DAISY, ASIFT for CODE/RepMatch).
Nearest-Neighbor Matching: For floating-point descriptors, FLANN with Euclidean distance is used; for binary descriptors, brute-force matching with Hamming distance. Ambiguous matches are filtered using the ratio test (threshold 0.8).
Correspondence Selection: Sparse matchers (e.g., SIFT, DAISY) use OpenCV’s five-point algorithm with RANSAC for geometric verification. Rich matchers (CODE, RepMatch, GMS) employ their own, typically more sophisticated, pose estimators.
4. Evaluated Algorithms
MatchBench covers 16 distinct feature matching systems, summarized in the following table:
| Name | Key Detector/Descriptor | Notable Property / Addition |
|---|---|---|
| SIFT | DoG + 128D gradient | Ratio test + FLANN + RANSAC |
| SURF | Fast Hessian + 64D descriptor | |
| ORB | FAST+Harris + BRIEF | Binary; high speed |
| BRISK | AGAST + binary descriptor | Memory efficient |
| KAZE | Nonlinear scale space + M-SURF | Robust to varying image structure |
| AKAZE | Fast KAZE approx. | Compact, fast |
| DLCO | CNN-learned local feature | Deep learning-based descriptor |
| FREAK | Retina-inspired binary | Biomimetic pattern |
| BinBoost | Boosted binary code | Learning-based binary |
| LATCH | Patch-triplet descriptor | Learning-based |
| DAISY | Dense gradient-based | Suited for dense matching |
| KVLD | Virtual Line + semi-local check | Extra geometric/photometric verification |
| GAIM | Affine simulation + SURF | Simulates view changes |
| CODE | ASIFT + global optimization | Rich matches, high cost |
| RepMatch | Geometry-aware extension of CODE | Suited for repetitive structures |
| GMS | ORB + grid-based filtering | Fast, effective in real time |
5. Quantitative Results and Analysis
Matching Ability: On short-baseline tasks, GMS outperforms all other methods (SP AUC ≈0.51/0.61/0.25/0.96 on Seqs 01–04), followed by DLCO and KAZE. For wide baselines, RepMatch leads (AUC ≈0.54–0.77–0.43–0.47), followed by CODE and GMS. Sparse, classical descriptors (SIFT, SURF, ORB) underperform in wide-baseline, particularly in low-texture or geometrically complex scenes.
Correspondence Sufficiency: Rich matchers (e.g., CODE, RepMatch) produce >1,000 inliers, GMS achieves ~100–300, while classical matchers supply <200 inliers.
Efficiency: ORB and GMS demonstrate high efficiency (ORB: ≈48 ms/pair, GMS: ≈46 ms/pair on CPU/GPU). High-performing rich matchers are costly (RepMatch: ≈10,780 ms/pair for selection; CODE: ≈1,365 ms on GPU + 3,080 ms selection).
Scene Dependence: All matchers achieve high AUC (>0.87) on high-resolution, well-textured street scenes (Seq 04). Scene complexity and texture scarcity (e.g., indoor, non-planar) introduce larger performance disparities and highlight the strengths of global or learning-based methods.
Methodological Trade-offs:
- KVLD and GAIM introduce geometric/photometric checks—offering modest benefits with considerable computational expense.
- CODE and RepMatch perform global optimization for high robustness at the cost of speed.
- GMS implements grid-based motion statistics, providing a favorable balance between speed and robustness.
6. Practical Guidelines
- Real-time SLAM/Visual Odometry: ORB combined with GMS is recommended for short-baseline use cases, delivering robust matching at ≈45 ms/pair and sufficient inliers.
- Offline Wide-baseline SfM: For maximal matching ability and correspondence sufficiency, RepMatch or CODE is optimal if runtime is not constrained; GMS provides efficient and adequate performance when <1 s/pair is required.
- Memory/Compute-limited Platforms: Binary features such as ORB, BRISK, or AKAZE used with the ratio test yield efficient correspondences; GMS can be added for improved inlier selection.
- Low-texture/Non-planar Scenes: Global optimization methods (RepMatch, CODE) or deep/local enhancements (DLCO, KAZE) are advantageous.
7. Open Challenges and Future Directions
Current benchmarking via pose-based verification does not address dense, per-pixel correspondence evaluation; thus, establishing high-precision, dense ground truth remains an open issue. Evaluation under severe illumination and appearance changes, such as day-night or weather variation, is lacking. There is a need for extension toward larger-scale, multi-camera datasets (e.g., Internet photo collections) with reliable structural ground truth. Reducing the computational bottleneck in RANSAC-style geometric verification, particularly via GPU acceleration, and developing end-to-end learned systems that unify detection, description, and matching, are proposed as key avenues for closing the gap between accuracy and speed in feature matching (Bian et al., 2018).