XRefine: Detector-Agnostic Subpixel Refinement
- XRefine is a detector-agnostic neural network architecture that refines keypoint correspondences from raw grayscale patches using cross-attention for subpixel accuracy.
- It employs a five-layer CNN for patch embedding and a differentiable soft-argmax to predict precise offsets, achieving state-of-the-art improvements in metrics like AUC5 on benchmarks.
- The method extends to multi-view feature track refinement, ensuring consistent keypoint alignment across frames and offering efficient runtime across various detectors.
XRefine is a detector-agnostic neural network architecture for sub-pixel refinement of keypoint correspondences in images, with applications in 3D vision such as structure-from-motion, visual odometry, and multi-view geometry. Unlike existing learned refinement methods, which require detector-specific training and access to detector-internal features, XRefine operates solely on raw grayscale image patches around matched keypoints and generalizes across arbitrary detectors without retraining. The architecture employs a cross-attention mechanism to predict refined keypoint locations, facilitating improved downstream geometric estimation and offering an efficient, lightweight forward pass. XRefine also extends naturally to multi-view feature track refinement, ensuring consistent correspondences across multiple frames. Experimental results on benchmarks such as MegaDepth, ScanNet, and KITTI demonstrate state-of-the-art accuracy and runtime advantages compared to both prior learned methods and dense optimization-based approaches (Schmid et al., 18 Jan 2026).
1. Motivation and Prior Methodologies
Sparse keypoint matching pipelines for 3D vision tasks typically consist of keypoint detection (at pixel or coarse precision), descriptor computation, feature matching, and geometric estimation (relative pose, depth). Even advanced detectors like SuperPoint and XFeat exhibit sub-pixel misalignments, with keypoint localization errors modeled as zero-mean Gaussian noise of standard deviation –$3$ pixels. These inaccuracies diminish the performance of downstream geometric tasks; for example, relative pose AUC5 shows rapid decline when pixel.
Recent learned refiners—such as Keypt2Subpx, XFeat-Refine, and PixSfM—address localization errors by learning to predict small corrective offsets. However, these approaches require access to internal representations of the detector (e.g., descriptors, score maps) and are typically retrained for each specific detector. Feature-metric optimizers (e.g., Lucas–Kanade or PixSfM) achieve improved accuracy but are computationally intensive and remain detector-specific (Schmid et al., 18 Jan 2026).
2. XRefine Network Architecture
XRefine is structured as a lightweight, detector-agnostic network that receives as input only localized grayscale image patches. Its main architectural stages are as follows:
2.1 Patch Extraction and Representation
For each matched keypoint between two images, XRefine extracts grayscale patches centered at the estimated keypoint coordinates in images and . Each patch is independently embedded using a five-layer convolutional neural network with the following structure:
- Conv1: channels, no padding, ReLU
- Conv2: channels, no padding
- Conv3: channels, no padding
- Conv4: channels, no padding
- Conv5: channels, no padding
The output is an embedding , with $9$ spatial tokens per patch.
2.2 Cross-Attention Module
To identify the precise sub-pixel correspondence, XRefine uses cross-attention. The embeddings are flattened into sequences of shape for each image, and learned positional encodings are added. The attention mechanism computes
where are linear projections of the positional-augmented embeddings from each image. Attention is applied symmetrically, updating both and . The "Large" XRefine variant stacks three such cross-attention blocks.
2.3 Score Map Head and Soft-Argmax
The cross-attended embeddings are passed through a convolution to yield score maps , followed by a non-linearity. A differentiable soft-argmax computes the refined sub-pixel offset :
The final refined keypoint is , .
3. Training Objectives and Losses
XRefine employs an epipolar geometry-based loss to train without requiring dense, per-detector sub-pixel ground truth. Given pairs of patches with known ground-truth relative pose (i.e., Fundamental or Essential matrix ), and refined keypoint homogeneous coordinates and , the epipolar residual is:
The total loss over matches is:
This training regime enforces geometric consistency of the refined correspondences.
4. Detector-Agnostic Generalization and Multi-View Refinement
XRefine is explicitly trained to operate without descriptors or score maps, relying only on raw pixel patches. Two training paradigms are defined:
- “Specific”: Train per detector/matcher by sampling the top 4096 matches per image, perturbing with Gaussian noise with pixels.
- “General”: Randomly sample 4096 depth-validated pixels in source image, project to target, perturb identically, and train only once.
The "general" model can refine matches from arbitrary detectors (SIFT, SuperPoint, XFeat, ALIKED, etc.) at inference without retraining.
Multi-view feature track refinement is achieved by treating one keypoint as the track reference and comparing all others against it via patch pairs. Offset predictions are made with respect to the reference, ensuring global track consistency. Each comparison uses the same epipolar loss, computed relative to the known pose between the reference and other views.
5. Experimental Evaluation
XRefine's effectiveness is demonstrated through extensive experiments on the MegaDepth, ScanNet, and KITTI datasets, evaluating against both detector-specific and agnostic refinement methods and across different keypoint detectors and matchers.
Key Results Overview
| Benchmark | No Refine (AUC5) | Keypt2Subpx | PixSfM | XRefine General | XRefine Specific |
|---|---|---|---|---|---|
| MegaDepth | 34.91 | 36.20 | 38.30 | 38.87 | 38.86 |
| ScanNet | — | — | — | — | — |
| KITTI | — | — | — | — | — |
(In MegaDepth / SuperPoint+MNN, AUC5 rises from 34.91 without refinement to 38.87 with XRefine general; Keypt2Subpx and PixSfM yield intermediate gains.)
In ETH3D multi-view triangulation at 2 cm accuracy, the fraction of correctly triangulated points increases from 87.73% (no refine) to 90.80% (XRefine general), with the higher values for PixSfM's joint optimization.
Runtime and Ablation Studies
For 2048 matches per image on an NVIDIA RTX A5000:
| Method | Runtime (ms) |
|---|---|
| XFeat-Refine | 0.55 |
| Keypt2Subpx | 3.43 |
| XRefine | 3.61 |
| PixSfM | 70.28 (1435.7 w/ S2DNet) |
Critical ablation findings include loss of performance when removing cross-attention (AUC5 drop from 47.52 to 41.20) or replacing the soft-argmax head with a descriptor similarity head (AUC5=45.58). A "Large Specific" model variant improves AUC5 to 50.05 but at increased computation time (19.7 ms).
6. Analysis and Implications
XRefine demonstrates that localized, cross-attention-driven refinement over small image patches enables detector-agnostic sub-pixel keypoint localization. This architecture separates the refinement process from detector-specific features and representations, facilitating both rapid inference and simple deployment across different matching pipelines. The use of an epipolar loss eliminates the need for dense annotation of sub-pixel ground truth, reducing training data requirements and improving flexibility.
The multi-view extension, pulling all track points relative to a reference frame, directly addresses inconsistency issues inherent in naïve pairwise refinement, ensuring that all refined matches remain coherent under multi-view geometry constraints.
A plausible implication is that this approach may be adaptable to further modalities, such as multimodal keypoint matching or domains with domain shifts, provided that relevant patch-based information can be exploited through cross-attention mechanisms.
7. Summary of Contributions
- Introduction of a cross-attention, patch-based refinement network that is fully detector-agnostic.
- A single, unified training procedure enables generalization across arbitrary detectors without retraining.
- Extension to consistent multi-view track refinement.
- Empirically demonstrated state-of-the-art gains in both geometric accuracy (relative pose AUC, triangulation) and runtime efficiency.
- Simplicity of the architecture and applicability to standard benchmarks substantiate its relevance for broader 3D vision systems (Schmid et al., 18 Jan 2026).