CUDA LATCH (CLATCH): GPU-Accelerated Descriptor

Updated 16 January 2026

CUDA LATCH (CLATCH) is a GPU-accelerated binary descriptor that leverages discriminative patch comparisons to efficiently extract local image features.
It employs CUDA optimizations such as warp-shuffle reductions, shared memory use, and branchless bit setting to achieve real-time processing speeds.
Integrated with SfM pipelines, CLATCH offers up to a 10× speedup compared to SIFT while maintaining competitive reconstruction quality.

CUDA LATCH (CLATCH) is a high-performance, GPU-accelerated variant of the LATCH (Learned Arrangements of Three Patch Codes) binary descriptor designed for rapid extraction and matching of local image features. The formulation and port to CUDA enable efficient real-time operation, substantially reducing computational time required for tasks such as structure-from-motion (SfM) and large-scale matching, while maintaining competitive recognition accuracy compared to established descriptors such as SIFT and deep-learned alternatives (Levi et al., 2015, Parker et al., 2016).

1. Mathematical Formulation and Descriptor Encoding

LATCH encodes local image appearances using binary strings derived from discriminative comparisons among patches. For each keypoint, consider a square window $W$ of size $N\times N$ pixels, rotated to a canonical orientation. The representation is a $T$ -bit integer $b(W)$ resulting from $T$ binary tests. Each test is defined by a triplet $ŝ_t=(p_{t,a},p_{t,1},p_{t,2})$ , specifying positions for three $k\times k$ patches in $W$ .

Extract patch intensities $P_{t,a}$ (anchor), $P_{t,1}$ and $P_{t,2}$ (companions). The $t$ -th bit is set as follows:

$g(W, ŝ_t) = \begin{cases} 1, & \|P_{t,a} - P_{t,1}\|^2_F > \|P_{t,a} - P_{t,2}\|^2_F \ 0, & \text{otherwise} \end{cases}$

with

$\|P - Q\|^2_F = \sum_{i=1}^{k}\sum_{j=1}^{k} (P_{ij} - Q_{ij})^2$

The descriptor is:

$b(W) = \sum_{t=1}^{T} g(W, ŝ_t) \cdot 2^{t-1} \in \{0, \ldots, 2^{T}-1\}$

Descriptors are compared by Hamming distance, computed efficiently on bit-packed representations (Levi et al., 2015, Parker et al., 2016).

2. Learning Discriminative Patch Arrangements

The selection of patch triplet arrangements is performed offline via discriminative learning. Utilizing the Brown et al. same/not-same benchmark, a large pool of candidate triplets is sampled. For each candidate, the agreement with label pairs is scored:

$score(ŝ_t) = \sum_{i=1}^{N_p} \left[ I(g_{it} = g_{i't}) I(\ell_i = \text{same}) + I(g_{it} \neq g_{i't}) I(\ell_i = \text{not-same}) \right]$

where $I(\cdot)$ is an indicator for response agreement and true/false label. Candidates are ranked and greedily selected, controlling for feature redundancy via Pearson correlation threshold $\tau$ (e.g., $\tau = 0.2$ ). The final $T$ triplets maximize discriminability under label supervision.

3. GPU Architecture and Computational Optimizations

CLATCH maps naturally onto CUDA's architecture via data-parallelism and bitwise operations. Extraction employs a kernel grid where each thread block processes 16 descriptors. Within a block, warps (32 threads) cooperate on bit computations:

Each warp computes a share of descriptor bits (e.g., bits $b_{t^b}, b_{t^b+32}, \ldots$ per thread).
Precomputed triplet tables are cached in device or constant memory.
Image patches are loaded into shared memory, with padding to avoid bank conflicts.
Bilinear rotation into canonical orientation is achieved using texture fetches for $64\times64$ (or as specified) pixel windows.
Arithmetic within patch comparisons is fully unrolled for register-level speed.
Bit setting is branchless: $\text{mask} = (d_1 > d_2) ? (1ULL \ll lane\_id) : 0$ ; $descriptor\_reg |= mask$ .
Efficient Hamming matching is performed via packed XOR and hardware popcount instructions.

Coalesced memory layout and warp-shuffle reductions (using __shfl_xor) further optimize throughput. Table summarizing main resource usage per 512-bit descriptor ( $k=8$ ):

Operation	Count	Example (T=512, k=8)
Memory loads	$3Tk^2$ floats	98,304
Arithmetic ops	$6Tk^2 + O(T)$ FLOPs	196,608
Bit operations	$O(T)$	512

4. Extraction and Matching Pipeline

Full feature pipeline consists of:

FAST keypoint detection (GPU/OpenCV).
Orientation estimation for detected keypoints.
CLATCH extraction kernel:
- Partition keypoints into blocks of 16 (or as hardware allows).
- Load and rotate $64\times64$ windows per keypoint, stored in shared memory.
- For each, compute 512 binary tests using $8\times8$ patch-triplets; pack into eight 64-bit words.
GPU-accelerated matching:
- Each thread processes a probe descriptor against a gallery via parallel XOR and popcount.
- Nearest neighbor indices and distances are written out in GPU memory.

Matching time per pair is approximately $0.5\ \text{ns}$ (XOR+popcount, GPU), facilitating brute-force matching for large datasets (Parker et al., 2016).

5. Performance Analysis and Reference Benchmarks

Extensive benchmarks highlight the computational gains of CLATCH and its impact on downstream reconstruction quality:

Descriptor	CPU Extraction (µs)	GPU Extraction (µs)	Size (bytes)
SIFT (128-float)	3,290	–	512
ORB (256-bit)	486	0.5	32
LATCH (512-bit)	616	–	64
PN-Net (128-float)	10,000	10	512
CLATCH (512-bit)	–	0.5	64

Structure-from-motion (SfM) results demonstrate parity in reprojection RMSE (within $\Delta \text{RMSE} < 0.1$ px) compared to SIFT and deep models, while total runtime for multi-image 3D reconstruction is reduced by $10\times$ (SIFT) and $3\times$ (PN-Net) on high-res datasets. Descriptor matching and footprint scale efficiently for large-scale dense correspondence (Levi et al., 2015, Parker et al., 2016).

6. Integration and Application: Structure-from-Motion

CLATCH is directly integrated within OpenMVG and other SfM toolkits:

Pipeline: CUDA FAST detector $\to$ CLATCH extraction $\to$ GPU Hamming matcher.
No changes required in incremental bundle adjustment or 3D mesh refinement; CLATCH-matched correspondences are compatible with standard SfM processing.
Typical pipelines (e.g., $\sim$ 5k $\times$ 3.7k images) execute in 15–20 s (GTX 1080), whereas SIFT- or DNN-based methods require 150–900 s for comparable reconstructions.
Completeness of 3D point recovery differs by less than 5% compared to SIFT (Parker et al., 2016).

7. Best Practices and Recommended Configurations

CLATCH achieves extreme speedup for feature extraction and matching with negligible loss of accuracy. Recommended operational parameters:

Patch size $k=8$ (with weight masking to reproduce LATCH's effective $7\times7$ ).
Descriptor length $T=512$ bits (64 bytes).
CUDA block size: 16 descriptors; 2 blocks per Streaming Multiprocessor for occupancy.
Extraction time: $0.5\ \mu\text{s}$ /descriptor (GTX 970M/1080).
Matching: XOR+popcount in hardware, $0.5$ ns per comparison.
Applicable for real-time SLAM, dense correspondence, and large-scale photogrammetry.

Open-source reference implementation is available from the original author at www.openu.ac.il/home/hassner/projects/LATCH (Parker et al., 2016).

This suggests that when throughput and scalability of local feature descriptors are a priority (massively parallel, high-density matching, real-time requirements), the CLATCH binary descriptor establishes state-of-the-art efficiency while delivering robust, competitive recognition rates as substantiated in peer-reviewed benchmarks and empirical reconstructions.

Markdown Report Issue Upgrade to Chat

References (2)

LATCH: Learned Arrangements of Three Patch Codes (2015)

The CUDA LATCH Binary Descriptor: Because Sometimes Faster Means Better (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CUDA LATCH (CLATCH).

CUDA LATCH (CLATCH): GPU-Accelerated Descriptor

1. Mathematical Formulation and Descriptor Encoding

2. Learning Discriminative Patch Arrangements

3. GPU Architecture and Computational Optimizations

4. Extraction and Matching Pipeline

5. Performance Analysis and Reference Benchmarks

6. Integration and Application: Structure-from-Motion

7. Best Practices and Recommended Configurations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CUDA LATCH (CLATCH): GPU-Accelerated Descriptor

1. Mathematical Formulation and Descriptor Encoding

2. Learning Discriminative Patch Arrangements

3. GPU Architecture and Computational Optimizations

4. Extraction and Matching Pipeline

5. Performance Analysis and Reference Benchmarks

6. Integration and Application: Structure-from-Motion

7. Best Practices and Recommended Configurations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research