CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Published 22 Mar 2023 in cs.CV | (2303.12417v2)

Abstract: Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-LLMs (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$^2$) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme.

Abstract PDF Upgrade to Chat

Citations (57)

View on Semantic Scholar

Summary

The paper introduces a novel contrastive pretraining framework that directly aligns 3D point clouds with open-vocabulary language and images without human curation.
It employs dual-level semantic and instance alignment, yielding state-of-the-art zero-shot performance across diverse indoor and outdoor datasets.
The framework demonstrates robust generalization, setting new benchmarks in 3D recognition and paving the way for flexible, open-world perception systems.

CLIP $^2$ : Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Introduction

CLIP $^2$ proposes a new paradigm for open-world 3D vision by extending large-scale vision-LLM (VLM) pretraining to the 3D point cloud domain. The paper systematically addresses the longstanding limitations of previous approaches, which either require expensive 3D annotations or rely on indirect 2D projections, sacrificing critical geometric information and generalization capability. CLIP $^2$ introduces a scalable data collection strategy for language-image-point proxies, enabling direct correlation alignment across modalities and robust zero-shot learning.

Framework Overview and Triplet Proxy Generation

CLIP $^2$ comprises two interleaved components: triplet proxy collection and cross-modal contrastive pretraining. The framework leverages unlabelled real-world data, exploiting naturally occurring correspondences among 3D point clouds, 2D images, and open-vocabulary language. The authors establish proxies by applying open-set detectors to extract image proposals, automatically linking them with language descriptors and localizing the corresponding 3D point clusters via precise geometric transformations for both indoor (RGB-D) and outdoor (LiDAR-camera) scenarios.

Figure 1: Schematic of the CLIP $^2$ architecture: triplet proxy collection and joint cross-modal contrastive learning.

Figure 2: Visualization of the triplet proxy generation pipeline for aligning text, image proposals, and 3D point clouds.

By amassing over 1.6 million language-image-point triplets from large-scale datasets—without human curation—the method provides rich semantic and geometric coverage, circumventing annotation bottlenecks endemic to prior work.

Figure 3: Examples of multi-modal representation for two 3D objects in diverse environments: raw point cloud, projected depth maps, and image patches.

For representation learning, CLIP $^2$ implements a dual-level alignment objective:

Semantic-level: Point cloud instances are aligned to canonical embeddings of open-vocabulary language queries, facilitating compositional and zero-shot generalization.
Instance-level: The same point clouds are additionally matched to visual embeddings from corresponding image proposals, further regularizing feature consistency and improving robustness to spatial variation and incomplete or occluded observations.

Instead of compressing 3D data into limited 2D views or depth maps, CLIP $^2$ directly encodes point cloud geometry, maintaining fine-grained structural information crucial for realistic scenarios.

Figure 4: Comparative schematic of pretraining paradigms for 3D recognition: legacy methods (via depth or image alignment) vs. CLIP $^2$ ’s original point cloud alignment across all three modalities.

Experimental Results

CLIP $^2$ achieves state-of-the-art zero-shot transfer performance across a spectrum of indoor and outdoor datasets, with marked improvements over contemporary baselines:

SUNRGB-D (Indoor): Top-1 accuracy of 61.3%, a >5% margin over previous bests, and up to 69.6% with modality ensembling.
ScanNet (Indoor): Top-1 accuracy of 38.5%, outperforming all alternatives, especially in large open-vocabulary settings with significant gains in Top-5 accuracy as the label set expands (e.g., +15.6% over prior for 384 classes).
nuScenes and ONCE (Outdoor): Average Top-1 accuracy of 37.8% on nuScenes, exceeding depth-based approaches by >20%.
ScanObjectNN (Few-Shot): Outperforms earlier 3D pretraining paradigms (CrossPoint, PointCLIP, Clip2Point) by a significant margin in both zero-shot and few-shot settings.
Figure 5: Illustrative open-world recognition examples: point cloud features robustly correlated with textual concepts in both indoor and outdoor environments.

Figure 6: Zero-shot localization and recognition: CLIP $^2$ identifies and classifies open-vocabulary objects in complex 3D scenes with no supervision.

Figure 7: Generalization to out-of-vocabulary categories in nuScenes: localization and recognition of objects beyond ground-truth annotations, such as ‘Tire’ and ‘Debris’.

Ablation Analysis: Direct alignment in the point cloud space yields consistently superior results over projected depth approaches; joint image-language-point supervision contributes additional gains.
Ensembling: Incorporating predictions from image and depth modalities with the primary 3D representation further boosts performance, especially when all signals are available.

Qualitative Insights

Extensive visualizations demonstrate CLIP $^2$ 's robust zero-shot open-world recognition, including saliency analyses revealing fine-grained alignment between text prompts and 3D spatial structure, and detection of novel, long-tail, and previously unlabelled object categories.

Figure 8: Saliency maps evidencing alignment between text semantics and 3D geometry in point cloud scenes.

Figure 9: Additional zero-shot recognition visualizations on SunRGB-D, illustrating the framework’s robustness in uncurated, cluttered, real-world environments.

Discussion and Implications

The results substantiate that CLIP $^2$ addresses a critical gap in 3D representation learning: obtaining transferable, annotation-free embeddings tied to open-vocabulary semantics without geometric information loss. This enables:

Open-world object discovery in safety-critical systems (e.g., autonomous driving, robotics) where taxonomies are inherently incomplete and semantic labels costly or unavailable.
Zero-shot and few-shot learning in new environments with arbitrary categories, powering rapid adaptation.
Modality-agnostic fusion: The presented ensembling analysis suggests future frameworks could further integrate signals from all available sensory inputs for improved reliability.

By highlighting the centrality of genuine 3D geometric representation aligned with semantic priors, CLIP $^2$ establishes a foundation for more capable and adaptive 3D perception systems.

Limitations and Future Outlook

The current framework exhibits limited spatial resolution for tight bounding box regression, as localization is based on point clusters rather than explicit box fitting. However, the approach naturally complements and can seed supervised or hybrid 3D detectors for downstream tasks. Scaling proxy generation via further multimodal mining, as well as extending the backbone architectures, offers clear avenues for enhancing performance and applicability.

The work implies a trajectory where future open-world 3D systems will unify cross-sensor, cross-modal data, leveraging Internet-scale language corpora and geometry-aware models to achieve semantic completeness and robust generalization in complex, dynamic environments.

Conclusion

CLIP $^2$ delivers a comprehensive solution for open-world zero-shot 3D recognition, outperforming prior art by directly encoding point cloud data and aligning it with both visual and linguistic representations. With scalable, annotation-free proxy generation and flexible cross-modal objectives, it significantly advances the state of transferable 3D learning and sets the groundwork for universal, open-vocabulary 3D perception.