MatchBench: Feature Matcher Benchmark

Updated 29 January 2026

The paper introduces MatchBench, a benchmark that comprehensively evaluates feature matchers based on matching ability, correspondence sufficiency, and efficiency.
It reorganizes popular datasets like TUM, KITTI, and Strecha, covering both short-baseline and wide-baseline scenarios to mimic real-world conditions.
It employs a rigorous two-view pose estimation pipeline with metrics such as rotational and translational error to objectively assess matcher performance.

MatchBench is a benchmark designed to provide the first uniform, comprehensive evaluation of feature matchers in computer vision. Unlike previous benchmarks that focused solely on individual aspects such as feature detectors or descriptors, MatchBench directly assesses feature matchers—algorithms that output correspondences between image pairs—which are foundational for high-level applications including Structure-from-Motion (SfM) and Visual SLAM. The benchmark evaluates matchers along three primary axes: matching ability (geometric correctness), correspondence sufficiency (number of inlier correspondences), and efficiency (processing runtime). It encompasses diverse scenario types, supporting both short-baseline (SLAM/video) and wide-baseline (SfM) image pairs (Bian et al., 2018).

1. Dataset Organization and Scene Coverage

MatchBench repurposes and reorganizes sequences from established public datasets to ensure comprehensive coverage of various scene types:

TUM RGB-D (indoor, office settings)
- 01-office: indoor, textured, short baseline.
- 02-teddy: indoor, non-planar, short baseline.
- 03-large-cabinet: indoor, weak-texture, short baseline.
KITTI Odometry (urban outdoor)
- 04-kitti: street view, high resolution, short baseline.
Strecha SfM (urban buildings)
- 05-castle: outdoor, wide baseline.
Subsampled wide-baseline TUM sequences
- 06-office-wide, 07-teddy-wide, 08-large-cabinet-wide: increased viewpoint changes, up to 5 seconds apart.

Each sequence is characterized by the number of images, resolution, total image pairs, and scene attributes (e.g., planarity, texture richness). Short-baseline portions mimic video/odometry scenarios; wide-baseline portions reflect challenging SfM use cases.

2. Evaluation Metrics and Pose Estimation Pipeline

MatchBench employs a two-view pose estimation framework to objectively assess matching quality. The evaluation proceeds as follows:

Essential Matrix Estimation: Given correspondences $C$ and camera intrinsics $K$ :

$E \gets \mathrm{five\_point}(C,K) \quad \text{or} \quad \begin{cases} F \gets \mathrm{eight\_point}(C) \ E = K^{-1} F K \end{cases}$

Pose Decomposition: Relative pose $T = (R, t)$ is extracted from $E$ via SVD-based decomposition.
Pose Error Calculation:
- Rotational error:
$e_r = \angle ( R_{gt}^\top R )$ - Translational error:

$e_t = \arccos \left( \frac{ t_{gt}^\top t }{ \|t_{gt}\| \|t\| } \right )$ - Combined error:

$e = \max(e_r,\, e_t )$ - An image pair is a "correct match" if $e < \tau$ ( $\tau$ varies from 1° to 10°).
Matching Ability (Success Ratio & AUC of SP Curve):

$\mathrm{SR}(\tau) = \frac{ |\{(i,j): e_{ij}<\tau\}| }{\text{total pairs}}$

$\mathrm{AUC} = \frac{1}{|\mathcal{T}|} \sum_{\tau\in\mathcal{T}} \mathrm{SR}(\tau), \quad \mathcal{T} = \{1°, \ldots, 10°\}$

Correspondence Sufficiency (AP Bar):

$\mathrm{AP}(\tau) = \frac{1}{|\mathcal{M}_\tau|} \sum_{(i,j)\in\mathcal{M}_\tau} |C_{ij}|$

with $\tau=5°$ for practical reporting.

Efficiency: Mean runtime (CPU and/or GPU) including detection, matching, and geometric verification (e.g., RANSAC).

3. Experimental Protocol and Workflow

The evaluation protocol precisely mirrors realistic application settings. For short baselines, each video is fragmented (e.g., TUM at $k=15$ frames, KITTI at $k=5$ ) and frames 2... $k$ are matched against frame 1 of each segment. Wide-baseline protocols involve matching all pairs in the Strecha sequence and infrequently paired (every ≈5 seconds) frames in TUM.

Keypoint and Descriptor Extraction: Each matcher uses its canonical detector-descriptor pipeline (e.g., SIFT, SURF, ORB, BRISK, KAZE, AKAZE, DLCO, FREAK, BinBoost, LATCH, DAISY, ASIFT for CODE/RepMatch).

Nearest-Neighbor Matching: For floating-point descriptors, FLANN with Euclidean distance is used; for binary descriptors, brute-force matching with Hamming distance. Ambiguous matches are filtered using the ratio test (threshold 0.8).

Correspondence Selection: Sparse matchers (e.g., SIFT, DAISY) use OpenCV’s five-point algorithm with RANSAC for geometric verification. Rich matchers (CODE, RepMatch, GMS) employ their own, typically more sophisticated, pose estimators.

4. Evaluated Algorithms

MatchBench covers 16 distinct feature matching systems, summarized in the following table:

Name	Key Detector/Descriptor	Notable Property / Addition
SIFT	DoG + 128D gradient	Ratio test + FLANN + RANSAC
SURF	Fast Hessian + 64D descriptor
ORB	FAST+Harris + BRIEF	Binary; high speed
BRISK	AGAST + binary descriptor	Memory efficient
KAZE	Nonlinear scale space + M-SURF	Robust to varying image structure
AKAZE	Fast KAZE approx.	Compact, fast
DLCO	CNN-learned local feature	Deep learning-based descriptor
FREAK	Retina-inspired binary	Biomimetic pattern
BinBoost	Boosted binary code	Learning-based binary
LATCH	Patch-triplet descriptor	Learning-based
DAISY	Dense gradient-based	Suited for dense matching
KVLD	Virtual Line + semi-local check	Extra geometric/photometric verification
GAIM	Affine simulation + SURF	Simulates view changes
CODE	ASIFT + global optimization	Rich matches, high cost
RepMatch	Geometry-aware extension of CODE	Suited for repetitive structures
GMS	ORB + grid-based filtering	Fast, effective in real time

5. Quantitative Results and Analysis

Matching Ability: On short-baseline tasks, GMS outperforms all other methods (SP AUC ≈0.51/0.61/0.25/0.96 on Seqs 01–04), followed by DLCO and KAZE. For wide baselines, RepMatch leads (AUC ≈0.54–0.77–0.43–0.47), followed by CODE and GMS. Sparse, classical descriptors (SIFT, SURF, ORB) underperform in wide-baseline, particularly in low-texture or geometrically complex scenes.

Correspondence Sufficiency: Rich matchers (e.g., CODE, RepMatch) produce >1,000 inliers, GMS achieves ~100–300, while classical matchers supply <200 inliers.

Efficiency: ORB and GMS demonstrate high efficiency (ORB: ≈48 ms/pair, GMS: ≈46 ms/pair on CPU/GPU). High-performing rich matchers are costly (RepMatch: ≈10,780 ms/pair for selection; CODE: ≈1,365 ms on GPU + 3,080 ms selection).

Scene Dependence: All matchers achieve high AUC (>0.87) on high-resolution, well-textured street scenes (Seq 04). Scene complexity and texture scarcity (e.g., indoor, non-planar) introduce larger performance disparities and highlight the strengths of global or learning-based methods.

Methodological Trade-offs:

KVLD and GAIM introduce geometric/photometric checks—offering modest benefits with considerable computational expense.
CODE and RepMatch perform global optimization for high robustness at the cost of speed.
GMS implements grid-based motion statistics, providing a favorable balance between speed and robustness.

6. Practical Guidelines

Real-time SLAM/Visual Odometry: ORB combined with GMS is recommended for short-baseline use cases, delivering robust matching at ≈45 ms/pair and sufficient inliers.
Offline Wide-baseline SfM: For maximal matching ability and correspondence sufficiency, RepMatch or CODE is optimal if runtime is not constrained; GMS provides efficient and adequate performance when <1 s/pair is required.
Memory/Compute-limited Platforms: Binary features such as ORB, BRISK, or AKAZE used with the ratio test yield efficient correspondences; GMS can be added for improved inlier selection.
Low-texture/Non-planar Scenes: Global optimization methods (RepMatch, CODE) or deep/local enhancements (DLCO, KAZE) are advantageous.

7. Open Challenges and Future Directions

Current benchmarking via pose-based verification does not address dense, per-pixel correspondence evaluation; thus, establishing high-precision, dense ground truth remains an open issue. Evaluation under severe illumination and appearance changes, such as day-night or weather variation, is lacking. There is a need for extension toward larger-scale, multi-camera datasets (e.g., Internet photo collections) with reliable structural ground truth. Reducing the computational bottleneck in RANSAC-style geometric verification, particularly via GPU acceleration, and developing end-to-end learned systems that unify detection, description, and matching, are proposed as key avenues for closing the gap between accuracy and speed in feature matching (Bian et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

MatchBench: An Evaluation of Feature Matchers (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MatchBench.