Papers
Topics
Authors
Recent
Search
2000 character limit reached

XRefine: Detector-Agnostic Subpixel Refinement

Updated 22 January 2026
  • XRefine is a detector-agnostic neural network architecture that refines keypoint correspondences from raw grayscale patches using cross-attention for subpixel accuracy.
  • It employs a five-layer CNN for patch embedding and a differentiable soft-argmax to predict precise offsets, achieving state-of-the-art improvements in metrics like AUC5 on benchmarks.
  • The method extends to multi-view feature track refinement, ensuring consistent keypoint alignment across frames and offering efficient runtime across various detectors.

XRefine is a detector-agnostic neural network architecture for sub-pixel refinement of keypoint correspondences in images, with applications in 3D vision such as structure-from-motion, visual odometry, and multi-view geometry. Unlike existing learned refinement methods, which require detector-specific training and access to detector-internal features, XRefine operates solely on raw grayscale image patches around matched keypoints and generalizes across arbitrary detectors without retraining. The architecture employs a cross-attention mechanism to predict refined keypoint locations, facilitating improved downstream geometric estimation and offering an efficient, lightweight forward pass. XRefine also extends naturally to multi-view feature track refinement, ensuring consistent correspondences across multiple frames. Experimental results on benchmarks such as MegaDepth, ScanNet, and KITTI demonstrate state-of-the-art accuracy and runtime advantages compared to both prior learned methods and dense optimization-based approaches (Schmid et al., 18 Jan 2026).

1. Motivation and Prior Methodologies

Sparse keypoint matching pipelines for 3D vision tasks typically consist of keypoint detection (at pixel or coarse precision), descriptor computation, feature matching, and geometric estimation (relative pose, depth). Even advanced detectors like SuperPoint and XFeat exhibit sub-pixel misalignments, with keypoint localization errors modeled as zero-mean Gaussian noise of standard deviation σ1\sigma \approx 1–$3$ pixels. These inaccuracies diminish the performance of downstream geometric tasks; for example, relative pose AUC5 shows rapid decline when σ>1\sigma > 1 pixel.

Recent learned refiners—such as Keypt2Subpx, XFeat-Refine, and PixSfM—address localization errors by learning to predict small corrective offsets. However, these approaches require access to internal representations of the detector (e.g., descriptors, score maps) and are typically retrained for each specific detector. Feature-metric optimizers (e.g., Lucas–Kanade or PixSfM) achieve improved accuracy but are computationally intensive and remain detector-specific (Schmid et al., 18 Jan 2026).

2. XRefine Network Architecture

XRefine is structured as a lightweight, detector-agnostic network that receives as input only localized grayscale image patches. Its main architectural stages are as follows:

2.1 Patch Extraction and Representation

For each matched keypoint between two images, XRefine extracts 11×1111 \times 11 grayscale patches pA,pBR1×11×11p_A, p_B \in \mathbb{R}^{1 \times 11 \times 11} centered at the estimated keypoint coordinates in images AA and BB. Each patch is independently embedded using a five-layer convolutional neural network with the following structure:

  • Conv1: 1161 \to 16 channels, no padding, ReLU
  • Conv2: 161616 \to 16 channels, no padding
  • Conv3: 166416 \to 64 channels, no padding
  • Conv4: 646464 \to 64 channels, no padding
  • Conv5: 646464 \to 64 channels, no padding

The output is an embedding eA,eBR64×3×3e_A, e_B \in \mathbb{R}^{64 \times 3 \times 3}, with $9$ spatial tokens per patch.

2.2 Cross-Attention Module

To identify the precise sub-pixel correspondence, XRefine uses cross-attention. The 3×33 \times 3 embeddings are flattened into sequences of shape 9×649 \times 64 for each image, and learned positional encodings are added. The attention mechanism computes

Attn(EAEB)=softmax(QAKA/d)VA\textrm{Attn}(E_A \leftarrow E_B) = \textrm{softmax}(Q_A' K_A^\top / \sqrt{d}) V_A

where QA,KA,VAQ_A', K_A, V_A are linear projections of the positional-augmented embeddings from each image. Attention is applied symmetrically, updating both eAe_A and eBe_B. The "Large" XRefine variant stacks three such cross-attention blocks.

2.3 Score Map Head and Soft-Argmax

The cross-attended embeddings are passed through a 3×33 \times 3 convolution to yield score maps SR1×3×3S \in \mathbb{R}^{1 \times 3 \times 3}, followed by a tanh\tanh non-linearity. A differentiable soft-argmax computes the refined sub-pixel offset (Δx,Δy)(\Delta x, \Delta y):

Δx=i,jxiexp(Si,j)i,jexp(Si,j),Δy=i,jyjexp(Si,j)i,jexp(Si,j)\Delta x = \frac{\sum_{i, j} x_i \exp(S_{i, j})}{\sum_{i, j} \exp(S_{i, j})}, \quad \Delta y = \frac{\sum_{i, j} y_j \exp(S_{i, j})}{\sum_{i, j} \exp(S_{i, j})}

The final refined keypoint is xrefined=xinitial+Δxx_{\textrm{refined}} = x_{\textrm{initial}} + \Delta x, yrefined=yinitial+Δyy_{\textrm{refined}} = y_{\textrm{initial}} + \Delta y.

3. Training Objectives and Losses

XRefine employs an epipolar geometry-based loss to train without requiring dense, per-detector sub-pixel ground truth. Given pairs of patches with known ground-truth relative pose (i.e., Fundamental or Essential matrix FF), and refined keypoint homogeneous coordinates x~A,i\tilde{x}_{A, i} and x~B,i\tilde{x}_{B, i}, the epipolar residual is:

Li=(x~B,i)Fx~A,iL_i = \| (\tilde{x}_{B, i})^\top F \tilde{x}_{A, i} \|

The total loss over MM matches is:

L=1Mi=1MLiL = \frac{1}{M} \sum_{i=1}^{M} L_i

This training regime enforces geometric consistency of the refined correspondences.

4. Detector-Agnostic Generalization and Multi-View Refinement

XRefine is explicitly trained to operate without descriptors or score maps, relying only on raw pixel patches. Two training paradigms are defined:

  • “Specific”: Train per detector/matcher by sampling the top 4096 matches per image, perturbing with Gaussian noise N(0,σ2I)\mathcal{N}(0, \sigma^2 I) with σ=1.5\sigma = 1.5 pixels.
  • “General”: Randomly sample 4096 depth-validated pixels in source image, project to target, perturb identically, and train only once.

The "general" model can refine matches from arbitrary detectors (SIFT, SuperPoint, XFeat, ALIKED, etc.) at inference without retraining.

Multi-view feature track refinement is achieved by treating one keypoint as the track reference and comparing all others against it via patch pairs. Offset predictions are made with respect to the reference, ensuring global track consistency. Each comparison uses the same epipolar loss, computed relative to the known pose between the reference and other views.

5. Experimental Evaluation

XRefine's effectiveness is demonstrated through extensive experiments on the MegaDepth, ScanNet, and KITTI datasets, evaluating against both detector-specific and agnostic refinement methods and across different keypoint detectors and matchers.

Key Results Overview

Benchmark No Refine (AUC5) Keypt2Subpx PixSfM XRefine General XRefine Specific
MegaDepth 34.91 36.20 38.30 38.87 38.86
ScanNet
KITTI

(In MegaDepth / SuperPoint+MNN, AUC5 rises from 34.91 without refinement to 38.87 with XRefine general; Keypt2Subpx and PixSfM yield intermediate gains.)

In ETH3D multi-view triangulation at 2 cm accuracy, the fraction of correctly triangulated points increases from 87.73% (no refine) to 90.80% (XRefine general), with the higher values for PixSfM's joint optimization.

Runtime and Ablation Studies

For 2048 matches per image on an NVIDIA RTX A5000:

Method Runtime (ms)
XFeat-Refine 0.55
Keypt2Subpx 3.43
XRefine 3.61
PixSfM 70.28 (1435.7 w/ S2DNet)

Critical ablation findings include loss of performance when removing cross-attention (AUC5 drop from 47.52 to 41.20) or replacing the soft-argmax head with a descriptor similarity head (AUC5=45.58). A "Large Specific" model variant improves AUC5 to 50.05 but at increased computation time (19.7 ms).

6. Analysis and Implications

XRefine demonstrates that localized, cross-attention-driven refinement over small image patches enables detector-agnostic sub-pixel keypoint localization. This architecture separates the refinement process from detector-specific features and representations, facilitating both rapid inference and simple deployment across different matching pipelines. The use of an epipolar loss eliminates the need for dense annotation of sub-pixel ground truth, reducing training data requirements and improving flexibility.

The multi-view extension, pulling all track points relative to a reference frame, directly addresses inconsistency issues inherent in naïve pairwise refinement, ensuring that all refined matches remain coherent under multi-view geometry constraints.

A plausible implication is that this approach may be adaptable to further modalities, such as multimodal keypoint matching or domains with domain shifts, provided that relevant patch-based information can be exploited through cross-attention mechanisms.

7. Summary of Contributions

  • Introduction of a cross-attention, patch-based refinement network that is fully detector-agnostic.
  • A single, unified training procedure enables generalization across arbitrary detectors without retraining.
  • Extension to consistent multi-view track refinement.
  • Empirically demonstrated state-of-the-art gains in both geometric accuracy (relative pose AUC, triangulation) and runtime efficiency.
  • Simplicity of the architecture and applicability to standard benchmarks substantiate its relevance for broader 3D vision systems (Schmid et al., 18 Jan 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XRefine.