CUDA LATCH (CLATCH): GPU-Accelerated Descriptor
- CUDA LATCH (CLATCH) is a GPU-accelerated binary descriptor that leverages discriminative patch comparisons to efficiently extract local image features.
- It employs CUDA optimizations such as warp-shuffle reductions, shared memory use, and branchless bit setting to achieve real-time processing speeds.
- Integrated with SfM pipelines, CLATCH offers up to a 10× speedup compared to SIFT while maintaining competitive reconstruction quality.
CUDA LATCH (CLATCH) is a high-performance, GPU-accelerated variant of the LATCH (Learned Arrangements of Three Patch Codes) binary descriptor designed for rapid extraction and matching of local image features. The formulation and port to CUDA enable efficient real-time operation, substantially reducing computational time required for tasks such as structure-from-motion (SfM) and large-scale matching, while maintaining competitive recognition accuracy compared to established descriptors such as SIFT and deep-learned alternatives (Levi et al., 2015, Parker et al., 2016).
1. Mathematical Formulation and Descriptor Encoding
LATCH encodes local image appearances using binary strings derived from discriminative comparisons among patches. For each keypoint, consider a square window of size pixels, rotated to a canonical orientation. The representation is a -bit integer resulting from binary tests. Each test is defined by a triplet , specifying positions for three patches in .
Extract patch intensities (anchor), and (companions). The -th bit is set as follows:
with
The descriptor is:
Descriptors are compared by Hamming distance, computed efficiently on bit-packed representations (Levi et al., 2015, Parker et al., 2016).
2. Learning Discriminative Patch Arrangements
The selection of patch triplet arrangements is performed offline via discriminative learning. Utilizing the Brown et al. same/not-same benchmark, a large pool of candidate triplets is sampled. For each candidate, the agreement with label pairs is scored:
where is an indicator for response agreement and true/false label. Candidates are ranked and greedily selected, controlling for feature redundancy via Pearson correlation threshold (e.g., ). The final triplets maximize discriminability under label supervision.
3. GPU Architecture and Computational Optimizations
CLATCH maps naturally onto CUDA's architecture via data-parallelism and bitwise operations. Extraction employs a kernel grid where each thread block processes 16 descriptors. Within a block, warps (32 threads) cooperate on bit computations:
- Each warp computes a share of descriptor bits (e.g., bits per thread).
- Precomputed triplet tables are cached in device or constant memory.
- Image patches are loaded into shared memory, with padding to avoid bank conflicts.
- Bilinear rotation into canonical orientation is achieved using texture fetches for (or as specified) pixel windows.
- Arithmetic within patch comparisons is fully unrolled for register-level speed.
- Bit setting is branchless: ; .
- Efficient Hamming matching is performed via packed XOR and hardware popcount instructions.
Coalesced memory layout and warp-shuffle reductions (using __shfl_xor) further optimize throughput. Table summarizing main resource usage per 512-bit descriptor ():
| Operation | Count | Example (T=512, k=8) |
|---|---|---|
| Memory loads | floats | 98,304 |
| Arithmetic ops | FLOPs | 196,608 |
| Bit operations | 512 |
4. Extraction and Matching Pipeline
Full feature pipeline consists of:
- FAST keypoint detection (GPU/OpenCV).
- Orientation estimation for detected keypoints.
- CLATCH extraction kernel:
- Partition keypoints into blocks of 16 (or as hardware allows).
- Load and rotate windows per keypoint, stored in shared memory.
- For each, compute 512 binary tests using patch-triplets; pack into eight 64-bit words.
- GPU-accelerated matching:
- Each thread processes a probe descriptor against a gallery via parallel XOR and popcount.
- Nearest neighbor indices and distances are written out in GPU memory.
Matching time per pair is approximately (XOR+popcount, GPU), facilitating brute-force matching for large datasets (Parker et al., 2016).
5. Performance Analysis and Reference Benchmarks
Extensive benchmarks highlight the computational gains of CLATCH and its impact on downstream reconstruction quality:
| Descriptor | CPU Extraction (µs) | GPU Extraction (µs) | Size (bytes) |
|---|---|---|---|
| SIFT (128-float) | 3,290 | – | 512 |
| ORB (256-bit) | 486 | 0.5 | 32 |
| LATCH (512-bit) | 616 | – | 64 |
| PN-Net (128-float) | 10,000 | 10 | 512 |
| CLATCH (512-bit) | – | 0.5 | 64 |
Structure-from-motion (SfM) results demonstrate parity in reprojection RMSE (within px) compared to SIFT and deep models, while total runtime for multi-image 3D reconstruction is reduced by (SIFT) and (PN-Net) on high-res datasets. Descriptor matching and footprint scale efficiently for large-scale dense correspondence (Levi et al., 2015, Parker et al., 2016).
6. Integration and Application: Structure-from-Motion
CLATCH is directly integrated within OpenMVG and other SfM toolkits:
- Pipeline: CUDA FAST detector CLATCH extraction GPU Hamming matcher.
- No changes required in incremental bundle adjustment or 3D mesh refinement; CLATCH-matched correspondences are compatible with standard SfM processing.
- Typical pipelines (e.g., 5k 3.7k images) execute in 15–20 s (GTX 1080), whereas SIFT- or DNN-based methods require 150–900 s for comparable reconstructions.
- Completeness of 3D point recovery differs by less than 5% compared to SIFT (Parker et al., 2016).
7. Best Practices and Recommended Configurations
CLATCH achieves extreme speedup for feature extraction and matching with negligible loss of accuracy. Recommended operational parameters:
- Patch size (with weight masking to reproduce LATCH's effective ).
- Descriptor length bits (64 bytes).
- CUDA block size: 16 descriptors; 2 blocks per Streaming Multiprocessor for occupancy.
- Extraction time: /descriptor (GTX 970M/1080).
- Matching: XOR+popcount in hardware, $0.5$ ns per comparison.
- Applicable for real-time SLAM, dense correspondence, and large-scale photogrammetry.
Open-source reference implementation is available from the original author at www.openu.ac.il/home/hassner/projects/LATCH (Parker et al., 2016).
This suggests that when throughput and scalability of local feature descriptors are a priority (massively parallel, high-density matching, real-time requirements), the CLATCH binary descriptor establishes state-of-the-art efficiency while delivering robust, competitive recognition rates as substantiated in peer-reviewed benchmarks and empirical reconstructions.