- The paper introduces a two-level acceleration structure that reduces BVH memory from 2.42GB to 432MB while enhancing cache locality for Gaussian primitives.
- It implements per-ray traversal checkpointing and replay to eliminate redundant node fetches, achieving speedups up to 6.09× over previous methods.
- The paper demonstrates GRTX’s compatibility with modern GPU architectures, paving the way for real-time neural rendering in complex, dynamic scenes.
GRTX: Architectural Advances for Efficient Ray Tracing of 3D Gaussian Scenes
Introduction
Three-dimensional Gaussian Splatting (3DGS) has established itself as the cornerstone for real-time neural rendering by representing radiance fields with anisotropic Gaussian primitives, exploiting rasterization for fast and photorealistic synthesis. However, rasterization-based Gaussian rendering inherently fails to support complex ray-dependent effects such as accurate handling of optical distortions, reflections, refractions, and shadows, which severely limits its applicability in robotics, AR/VR, and scientific visualization. Recent works have demonstrated ray tracing-based Gaussian rendering ("3DGRT") to overcome these limits, but such approaches suffer from excessive acceleration structure overheads and inefficient multi-round traversal, resulting in significantly degraded performance compared to rasterization.
"GRTX: Efficient Ray Tracing for 3D Gaussian-Based Rendering" (2601.20429) scrutinizes the algorithmic and architectural bottlenecks in Gaussian ray tracing and introduces the GRTX system, integrating complementary software and hardware optimizations for modern GPU ray tracing units. GRTX enables substantial reductions in BVH memory footprint, traversal cost, and redundancy, achieving strong speedups (4.36×, up to 6.09×) compared to prevailing baselines, with negligible implementation cost. The following exposition details the technical contributions, empirical findings, and potential impact of this work.
Technical Innovations
Streamlined Acceleration Structures for Gaussian Primitives
Gaussian scenes typically comprise hundreds of thousands to millions of anisotropic Gaussians. Prior ray tracing approaches construct monolithic BVHs by encapsulating each Gaussian in triangle meshes (e.g., 20- or 80-faced icosahedrons), enabling GPU hardware-accelerated ray-triangle intersection but substantially inflating BVH size and degrading cache locality. The core insight of GRTX is leveraging the observation that, after ray space transformation (to leaf instance nodes in TLAS), a Gaussian ellipsoid can be represented as a unit sphere. Modern GPUs natively perform such ray transformations, thus permitting the construction of a two-level acceleration structure: a TLAS managing instances for each Gaussian and a shared BLAS containing a single template sphere mesh.
This structural optimization results in several benefits:
- Drastic reduction in overall BVH size and memory consumption (e.g., from 2.42 GB to 432 MB for the Truck scene).
- Increased cache hit rates due to improved node access locality.
- Compatibility with hardware-accelerated intersection tests (unit sphere in NVIDIA Blackwell, triangles elsewhere), while minimizing false positives.
This architectural paradigm is shown in the following illustration:
Figure 1: TLAS and shared BLAS organization in GRTX, yielding minimal BVH overhead by instantiating Gaussians as transformed unit spheres.
Hardware Checkpointing for Multi-Round Traversal
Ray tracing-based Gaussian rendering typically employs k-buffer accumulation and early ray termination (ERT): rays undergo multiple traversal rounds to sequentially collect the next k closest Gaussians for blending, terminating early once sufficient opacity is reached. Previously, every traversal round was forced to restart from the BVH root, incurring frequent redundant node visits and excessive intersection tests. GRTX proposes per-ray traversal checkpointing and replay—a minimal extension to existing RT hardware units:
- Nodes encountered in one round but falling outside current interval (thit>tmax) are checkpointed in global memory.
- In subsequent rounds, traversal resumes directly from checkpointed nodes instead of the BVH root, drastically reducing redundant fetches and intersection tests.
- Eviction buffers are used to efficiently manage non-selected Gaussians in the k-buffer, sorted for reuse in additional traversal rounds.
The mechanism is exemplified below:
Figure 2: Overview of GRTX's checkpointing and replay mechanism in GPU RT units for efficient multi-round traversal.
The proposed hardware extension incurs only ~1 KB per RT core for checkpointing data, with checkpoint/eviction buffers placed in global memory and memory usage capped below 6.3% of total GPU memory for the largest evaluated scenes.
Empirical Evaluation
The paper benchmarks GRTX against the state-of-the-art 3DGRT using both simulated (Vulkan-Sim) and real hardware (RTX 5090, AMD Radeon RX 9070 XT), with complex indoor and outdoor scenes (up to 2.43M Gaussians). Key findings include:
- End-to-end performance: GRTX achieves average 4.36× speedup (up to 6.09×); software-only GRTX-SW realizes 1.44–2.15× speedups on commodity GPUs (Figure 3).
- Reduction in node fetches: Average node fetches are reduced by 3.03×, and cache hit rates for BVH node accesses are more than doubled (>70%). L2 cache accesses decrease by 4.75×.
- Checkpointing efficacy: Hardware checkpointing (GRTX-HW) delivers up to 1.94× additional speedup by avoiding redundant traversals, most pronounced in scenes with large/overlapping Gaussian bounding boxes.
- Ray-sphere intersection: Direct use of sphere primitives further reduces intersection overhead in hardware supporting native ray-sphere tests (Figure 4).
- Secondary rays: GRTX-HW retains strong benefits for both primary and secondary rays (reflections, refractions) regardless of ray coherence (Figure 5).
- Cross-vendor applicability: GRTX-SW avoids monolithic BVH buffer overflows on AMD GPUs, achieving up to 3.42× speedup (Figure 6).
- k-buffer sizing & scaling: Performance impact of k-buffer size, resolution, and field-of-view variations is systematically analyzed (Figure 7, Figure 8).
Core benefits are summarized visually:
Figure 3: Speedup of GRTX over baseline RT with icosahedron primitives, demonstrating consistent substantial acceleration.
Figure 9: L1 cache hit rates for BVH node fetches, markedly improved by GRTX’s shared BLAS scheme.
Figure 10: Normalized node fetch counts demonstrating the elimination of redundant accesses.
Implications and Future Directions
GRTX’s approach holistically addresses the two dominant inefficiencies in Gaussian ray tracing: acceleration structure bloat and redundant traversal. For practical applications, this unlocks efficient physically-based rendering of massive Gaussian scenes at scale, extending neural rendering’s efficacy to domains where global illumination, transparency, non-standard optics, and real-time dynamics are requisite.
The shared BLAS paradigm and per-ray checkpointing are compatible with current graphics pipeline runtimes (Vulkan, DirectX, OptiX), hardware architectures (NVIDIA, AMD, Intel), and multi-level instancing for dynamic/multi-object scenes. Importantly, GRTX’s ideas can generalize to other volumetric primitive types and hierarchical data structures, where similar bloat and traversal redundancy remain bottlenecks.
Open future directions revealed by the work include:
- Enhancing throughput of ray-sphere intersection hardware and reducing atomic/serialization costs in shader management.
- Exploring further memory hierarchy optimizations and treelet-prefetching in conjunction with GRTX.
- Evaluating the integration of GRTX-like acceleration structures for mixed-primitives scenes (triangles + volumetrics).
- Extending checkpoint/replay concepts to general hierarchical search and data analytics workloads utilizing RT cores.
Conclusion
"GRTX: Efficient Ray Tracing for 3D Gaussian-Based Rendering" systematically analyzes, addresses, and validates the principal bottlenecks in Gaussian scene ray tracing. By integrating streamlined acceleration structure construction and traversal checkpointing—both architecturally minimalistic and maximally effective—the proposed methods unlock robust acceleration of photorealistic neural rendering via ray tracing, with general applicability across GPU architectures and volumetric primitive types. GRTX stands as a blueprint for next-generation neural rendering hardware and algorithms, with strong implications for graphics, vision, and simulation workloads.