Papers
Topics
Authors
Recent
Search
2000 character limit reached

CUDA LATCH (CLATCH): GPU-Accelerated Descriptor

Updated 16 January 2026
  • CUDA LATCH (CLATCH) is a GPU-accelerated binary descriptor that leverages discriminative patch comparisons to efficiently extract local image features.
  • It employs CUDA optimizations such as warp-shuffle reductions, shared memory use, and branchless bit setting to achieve real-time processing speeds.
  • Integrated with SfM pipelines, CLATCH offers up to a 10× speedup compared to SIFT while maintaining competitive reconstruction quality.

CUDA LATCH (CLATCH) is a high-performance, GPU-accelerated variant of the LATCH (Learned Arrangements of Three Patch Codes) binary descriptor designed for rapid extraction and matching of local image features. The formulation and port to CUDA enable efficient real-time operation, substantially reducing computational time required for tasks such as structure-from-motion (SfM) and large-scale matching, while maintaining competitive recognition accuracy compared to established descriptors such as SIFT and deep-learned alternatives (Levi et al., 2015, Parker et al., 2016).

1. Mathematical Formulation and Descriptor Encoding

LATCH encodes local image appearances using binary strings derived from discriminative comparisons among patches. For each keypoint, consider a square window WW of size N×NN\times N pixels, rotated to a canonical orientation. The representation is a TT-bit integer b(W)b(W) resulting from TT binary tests. Each test is defined by a triplet s^t=(pt,a,pt,1,pt,2)ŝ_t=(p_{t,a},p_{t,1},p_{t,2}), specifying positions for three k×kk\times k patches in WW.

Extract patch intensities Pt,aP_{t,a} (anchor), Pt,1P_{t,1} and Pt,2P_{t,2} (companions). The tt-th bit is set as follows:

g(W,s^t)={1,∥Pt,a−Pt,1∥F2>∥Pt,a−Pt,2∥F2 0,otherwiseg(W, ŝ_t) = \begin{cases} 1, & \|P_{t,a} - P_{t,1}\|^2_F > \|P_{t,a} - P_{t,2}\|^2_F \ 0, & \text{otherwise} \end{cases}

with

∥P−Q∥F2=∑i=1k∑j=1k(Pij−Qij)2\|P - Q\|^2_F = \sum_{i=1}^{k}\sum_{j=1}^{k} (P_{ij} - Q_{ij})^2

The descriptor is:

b(W)=∑t=1Tg(W,s^t)⋅2t−1∈{0,…,2T−1}b(W) = \sum_{t=1}^{T} g(W, ŝ_t) \cdot 2^{t-1} \in \{0, \ldots, 2^{T}-1\}

Descriptors are compared by Hamming distance, computed efficiently on bit-packed representations (Levi et al., 2015, Parker et al., 2016).

2. Learning Discriminative Patch Arrangements

The selection of patch triplet arrangements is performed offline via discriminative learning. Utilizing the Brown et al. same/not-same benchmark, a large pool of candidate triplets is sampled. For each candidate, the agreement with label pairs is scored:

score(s^t)=∑i=1Np[I(git=gi′t)I(ℓi=same)+I(git≠gi′t)I(ℓi=not-same)]score(ŝ_t) = \sum_{i=1}^{N_p} \left[ I(g_{it} = g_{i't}) I(\ell_i = \text{same}) + I(g_{it} \neq g_{i't}) I(\ell_i = \text{not-same}) \right]

where I(â‹…)I(\cdot) is an indicator for response agreement and true/false label. Candidates are ranked and greedily selected, controlling for feature redundancy via Pearson correlation threshold Ï„\tau (e.g., Ï„=0.2\tau = 0.2). The final TT triplets maximize discriminability under label supervision.

3. GPU Architecture and Computational Optimizations

CLATCH maps naturally onto CUDA's architecture via data-parallelism and bitwise operations. Extraction employs a kernel grid where each thread block processes 16 descriptors. Within a block, warps (32 threads) cooperate on bit computations:

  • Each warp computes a share of descriptor bits (e.g., bits btb,btb+32,…b_{t^b}, b_{t^b+32}, \ldots per thread).
  • Precomputed triplet tables are cached in device or constant memory.
  • Image patches are loaded into shared memory, with padding to avoid bank conflicts.
  • Bilinear rotation into canonical orientation is achieved using texture fetches for 64×6464\times64 (or as specified) pixel windows.
  • Arithmetic within patch comparisons is fully unrolled for register-level speed.
  • Bit setting is branchless: mask=(d1>d2)?(1ULL≪lane_id):0\text{mask} = (d_1 > d_2) ? (1ULL \ll lane\_id) : 0; descriptor_reg∣=maskdescriptor\_reg |= mask.
  • Efficient Hamming matching is performed via packed XOR and hardware popcount instructions.

Coalesced memory layout and warp-shuffle reductions (using __shfl_xor) further optimize throughput. Table summarizing main resource usage per 512-bit descriptor (k=8k=8):

Operation Count Example (T=512, k=8)
Memory loads 3Tk23Tk^2 floats 98,304
Arithmetic ops 6Tk2+O(T)6Tk^2 + O(T) FLOPs 196,608
Bit operations O(T)O(T) 512

4. Extraction and Matching Pipeline

Full feature pipeline consists of:

  1. FAST keypoint detection (GPU/OpenCV).
  2. Orientation estimation for detected keypoints.
  3. CLATCH extraction kernel:
    • Partition keypoints into blocks of 16 (or as hardware allows).
    • Load and rotate 64×6464\times64 windows per keypoint, stored in shared memory.
    • For each, compute 512 binary tests using 8×88\times8 patch-triplets; pack into eight 64-bit words.
  4. GPU-accelerated matching:
    • Each thread processes a probe descriptor against a gallery via parallel XOR and popcount.
    • Nearest neighbor indices and distances are written out in GPU memory.

Matching time per pair is approximately 0.5 ns0.5\ \text{ns} (XOR+popcount, GPU), facilitating brute-force matching for large datasets (Parker et al., 2016).

5. Performance Analysis and Reference Benchmarks

Extensive benchmarks highlight the computational gains of CLATCH and its impact on downstream reconstruction quality:

Descriptor CPU Extraction (µs) GPU Extraction (µs) Size (bytes)
SIFT (128-float) 3,290 – 512
ORB (256-bit) 486 0.5 32
LATCH (512-bit) 616 – 64
PN-Net (128-float) 10,000 10 512
CLATCH (512-bit) – 0.5 64

Structure-from-motion (SfM) results demonstrate parity in reprojection RMSE (within ΔRMSE<0.1\Delta \text{RMSE} < 0.1 px) compared to SIFT and deep models, while total runtime for multi-image 3D reconstruction is reduced by 10×10\times (SIFT) and 3×3\times (PN-Net) on high-res datasets. Descriptor matching and footprint scale efficiently for large-scale dense correspondence (Levi et al., 2015, Parker et al., 2016).

6. Integration and Application: Structure-from-Motion

CLATCH is directly integrated within OpenMVG and other SfM toolkits:

  • Pipeline: CUDA FAST detector →\to CLATCH extraction →\to GPU Hamming matcher.
  • No changes required in incremental bundle adjustment or 3D mesh refinement; CLATCH-matched correspondences are compatible with standard SfM processing.
  • Typical pipelines (e.g., ∼\sim5k ×\times 3.7k images) execute in 15–20 s (GTX 1080), whereas SIFT- or DNN-based methods require 150–900 s for comparable reconstructions.
  • Completeness of 3D point recovery differs by less than 5% compared to SIFT (Parker et al., 2016).

CLATCH achieves extreme speedup for feature extraction and matching with negligible loss of accuracy. Recommended operational parameters:

  • Patch size k=8k=8 (with weight masking to reproduce LATCH's effective 7×77\times7).
  • Descriptor length T=512T=512 bits (64 bytes).
  • CUDA block size: 16 descriptors; 2 blocks per Streaming Multiprocessor for occupancy.
  • Extraction time: 0.5 μs0.5\ \mu\text{s}/descriptor (GTX 970M/1080).
  • Matching: XOR+popcount in hardware, $0.5$ ns per comparison.
  • Applicable for real-time SLAM, dense correspondence, and large-scale photogrammetry.

Open-source reference implementation is available from the original author at www.openu.ac.il/home/hassner/projects/LATCH (Parker et al., 2016).

This suggests that when throughput and scalability of local feature descriptors are a priority (massively parallel, high-density matching, real-time requirements), the CLATCH binary descriptor establishes state-of-the-art efficiency while delivering robust, competitive recognition rates as substantiated in peer-reviewed benchmarks and empirical reconstructions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CUDA LATCH (CLATCH).