ClipGS-VR: 3D Gaussian Splatting in VR

Updated 3 February 2026

ClipGS-VR is a family of techniques that combines 3D Gaussian Splatting with CLIP-based vision-language alignment to enable immersive visualization and multi-modal interaction in VR.
It employs offline precomputation and efficient GPU-based rendering to manage complex volumetric data on mobile VR hardware, optimizing performance and memory usage.
Empirical evaluations demonstrate state-of-the-art results in medical visualization, cross-modal retrieval, and VR sketch-guided 3D generation with interactive frame rates and high-fidelity outputs.

ClipGS-VR refers to a family of techniques and frameworks that leverage 3D Gaussian Splatting in combination with CLIP-based vision-language alignment for interactive and immersive visualization, retrieval, and generation of volumetric data in virtual reality (VR). These approaches address the challenges of high-fidelity 3D rendering and efficient multi-modal interaction on VR hardware, especially mobile headsets with limited computational resources. ClipGS-VR encompasses (1) unified cinematic visualization pipelines for volumetric medical data on mobile VR, (2) vision-language-driven 3D content retrieval, and (3) multi-modal VR sketch-guided 3D object generation—each exploiting Gaussian Splatting representations with explicit CLIP integration (Tong et al., 27 Jan 2026, Jiao et al., 2024, Gu et al., 16 Mar 2025).

1. Architectural Principles and Data Pipeline

ClipGS-VR's core is the explicit 3D Gaussian Splatting (3DGS) representation, where volumes or object surfaces are modeled as a set of Gaussians, each parameterized by a spatial mean $\mu \in \mathbb{R}^3$ , covariance $\Sigma$ , color, and opacity $\alpha$ . In medical visualization, as in "ClipGS-VR: Immersive and Interactive Cinematic Visualization of Volumetric Medical Data in Mobile Virtual Reality," this model is extended and optimized for consumer VR with the following key strategies (Tong et al., 27 Jan 2026):

Offline Preprocessing: Rather than relying on runtime neural networks, all learnable adjustments (e.g., from ClipGS modules for truncation and deformation) are precomputed for a quantized set of slicing planes (e.g., $M=200$ orientations). The per-state outputs—including visibility and deformation—are baked into the Gaussian attributes, forming discrete "rendering layers."
Consolidated Layered Storage: The static and dynamic (slice-local) Gaussians are organized into one unified asset (e.g., $\sim$ 40 MB/case), with a single copy for shared data and differential "delta" layers for high-fidelity transition around the slicing planes.
Shader-Based Interpolation and Opacity Modulation: At runtime, the nearest precomputed layers are selected and blended on-GPU, with per-Gaussian opacity smoothly modulated using a gradient function based on plane-centric signed distances:

$\sigma = \mathrm{clamp}\left(\frac{1}{2} + \frac{\mu \cdot n - c}{2\,s_n},\,0,\,1\right), \quad \alpha' = \alpha \sigma$

This yields anti-aliased, visually coherent cross-sections even at arbitrary slicing angles.

For multi-modal retrieval ("CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting"), the pipeline additionally includes a 3DGS encoder (GS Tokenizer + transformer), ViT-based image encoder, and CLIP text encoder, producing normalized embeddings for downstream tasks (Jiao et al., 2024). Triplet-based training enforces cross-modal alignment.

In VR sketch-guided generation ("VRsketch2Gaussian"), a dedicated Sketch-CLIP alignment module embeds sparse VR line drawings into the shared CLIP-3DGS space, and a diffusion-based generative model synthesizes 3D Gaussians conditioned on sketch and text features (Gu et al., 16 Mar 2025).

2. Interactive Rendering and Mobile VR Optimization

ClipGS-VR achieves interactive framerates ( $\sim$ 72 FPS at $<20$ ms latency on Meta Quest 3) by eliminating all runtime neural inference and relying on fast GPU-based rendering of precomputed Gaussian layers (Tong et al., 27 Jan 2026). The inference pipeline processes:

Input from the 6-DoF VR controller, yielding slicing plane parameters $(n, c)$ .
Determination of the two nearest discrete slicing states, fetching layer data for GPU blending.
Dynamic interpolation and gradient-based opacity adjustment per Gaussian, achieving artifact-free anti-aliased rendering.
Integrated UI and VR locomotion for immersive manipulation (e.g., snap-turn, smooth move, anatomical focusing).

Through hierarchical storage, shared spherical harmonics, and culling, GPU memory remains tightly bounded (e.g., $40$ MB data + $150$ MB buffer), enabling deployment on stand-alone mobile hardware.

In the vision-language paradigm, CLIP-GS (and its VR adaptation) fuse 3DGS with CLIP embedding spaces for cross-modal retrieval and understanding (Jiao et al., 2024). The main processes are:

GS Tokenizer: Converts variable-sized 3D Gaussian sets into a sequence of tokens via Farthest Point Sampling, patch-wise 1×3 convolutions, and a point-cloud MLP. The tokens are aggregated with learnable positional embeddings into a transformer, yielding the [CLS] global 3DGS embedding $E^g$ .
Contrastive and Voting Losses: The 3DGS encoder is trained to match CLIP-derived text ( $E^T$ ) and image ( $E^I$ ) embeddings via symmetric contrastive loss and an image-voting mechanism for more view-consistent alignment.
Triplet Training: Constructed over large-scale datasets (e.g., 240K meshes), each training step involves sampling triplets, encoding them via the respective modules, and backpropagating through the 3DGS branch only.
Latency Optimization for VR: Model size and patch count may be reduced ( $g\approx64$ , INT8 quantization) for on-device inference. Pre-computed embeddings and FAISS indexing allow sub-10 ms cross-modal lookup, enabling speech-driven and pointing-based retrieval in VR.

In the VRsketch2Gaussian approach, the pipeline consists of two main stages (Gu et al., 16 Mar 2025):

Stage 1: Sketch-CLIP Feature Alignment— A dual-phase contrastive pre-training and fine-tuning regime first aligns dense ShapeNet point clouds with CLIP image/text embeddings, and then aligns VR sketch point clouds (via a lightweight Point-BERT-like transformer encoder) into the same embedding space, using combined symmetric contrastive and triplet losses. This ensures that sparse VR sketches are mapped compatibly with visual and textual descriptions.
Stage 2: Conditioned 3DGS Generation— The fused embedding (sketch + text) conditions a 3D U-Net–based diffusion model, which generates a grid of Gaussian parameters; differentiable rasterization and additional RGB/depth/normal matching losses refine the output. The Efficient Constrained Densification algorithm ensures an optimal number of output Gaussians.

The VRSS dataset, supporting this research, includes over 2K VR sketch/text/image/3DGS/point cloud tuples across 55 categories, enabling quantitative evaluation for retrieval, classification, and 3D generation.

5. Empirical Evaluation and Quantitative Results

Measured in high-fidelity VR rendering and multi-modal retrieval/generation contexts, ClipGS-VR achieves state-of-the-art empirical performance:

Medical Visualization (ClipGS-VR on Quest 3): In uniaxial slicing, PSNR = $33.40$ dB and SSIM = $0.9698$ (improvement of +2.85 dB/+0.0156 SSIM over hard-cut baseline); maintains $72$ FPS with $<20$ ms latency and $40$ MB memory footprint (Tong et al., 27 Jan 2026).
Usability: In clinical tasks, 6-DoF arbitrary slicing yields SUS = $88.2\pm9.4$ , Efficiency = $4.7\pm0.5$ vs. uniaxial (SUS = $71.8\pm14.4$ ), both statistically significant.
Retrieval and Classification (CLIP-GS): Text $\rightarrow$ 3D R@1 = $36.8$\%, Image $\rightarrow$ 3D R@1 = $75.6$\%, improving by $5$–$10$ pts over point cloud baselines; zero-shot classification Top-1 = $48.5$\% (Jiao et al., 2024).
VR Sketch-Based Generation: Outperforms alternatives on FVRS/VRSS in both top- $k$ retrieval, shape generation Chamfer/FID/CLIP score, and generation latency ($1.46$ s), with superior texture and geometry alignment (Gu et al., 16 Mar 2025).

6. Limitations and Prospective Extensions

Known limitations include (Tong et al., 27 Jan 2026, Gu et al., 16 Mar 2025):

Precomputation Cost: Offline baking of all slicing states (e.g., $M=200$ per case) imposes time and storage overhead, limiting updateability for dynamic volumes or patient-specific data.
Static Volumes/Objects: Current methods are restricted to static geometry, with no real-time deformation or physics support.
Interpolation Artifacts: Using discrete precomputed states, interpolation may cause fidelity loss when slicing planes are far from stored orientations.
Model Compression vs. Quality: Reducing token count or float precision for VR acceleration may slightly degrade retrieval/generation accuracy.
Semantic Generalization: While CLIP-based approaches yield strong open-vocabulary retrieval and segmentation, scenarios with ambiguous or noisy input (e.g., extremely sparse sketches) still challenge alignment.

Future work focuses on adaptive state sampling (favoring clinically or semantically significant splittings), lightweight modules for real-time deformation, and extending to multimodal fusion (e.g., overlaying segmentations or functional imaging) (Tong et al., 27 Jan 2026). In generation, continued advances in cross-modal contrastive learning, dataset size/diversity, and differentiable rendering scalability are expected.

ClipGS-VR unifies advances in 3D vision-language representation (CLIP-based contrastive learning), explicit surface/volume rendering (Gaussian Splatting), real-time GPU rasterization, and interactive VR/AR workflows. The medical visualization framework brings cinematic (photorealistic, high-detail) rendering to untethered mobile VR, overcoming the computational barriers of heavy neural inference at runtime. The multimodal pipelines inherit from and extend both representation learning (Uni3D, OpenShape, Point-BERT) and generative diffusion architectures, facilitating retrieval and conditional 3D content creation from free-form natural language or VR sketch input (Jiao et al., 2024, Gu et al., 16 Mar 2025).

The VRSS dataset provides a comprehensive foundation for ongoing research in multi-modal, sketch-coherent, and CLIP-aligned 3D generation and retrieval, with the entire ClipGS-VR ecosystem reinforcing the feasibility of high-fidelity interactive 3D applications within the constraints of current-generation consumer VR hardware.