ONNX-Gaussian Generator for Real-Time 3DGS
- ONNX-based Gaussian Generator is a neural module in ONNX format that produces 3D Gaussian primitives for rendering applications.
- It integrates dynamic inference with GPU-accelerated pipelines, enabling real-time 3D Gaussian Splatting and eliminating CPU bottlenecks.
- The system leverages strict I/O contracts and efficient pre-processing to achieve sub-millisecond frame times in advanced WebGPU rendering.
An ONNX-based Gaussian Generator is a standardized neural network module, exported in Open Neural Network Exchange (ONNX) format, for producing frame-specific Gaussian primitives within a neural rendering pipeline. Deployed primarily in the context of real-time 3D Gaussian Splatting (3DGS), these generators integrate dynamic inference with GPU-accelerated rendering, eliminating legacy pipeline constraints and CPU bottlenecks. The ONNX-based Gaussian Generator is a core element in platforms such as Visionary, which unifies the inference and rendering of generative and reconstructive world models directly in the browser, leveraging WebGPU for high-throughput interactive synthesis and visualization (Gong et al., 9 Dec 2025).
1. Interface Schema and Data Contract
The ONNX-based Gaussian Generator adheres to a strict I/O schema per animation frame, ensuring compatibility and efficient interoperation with WebGPU-based rendering engines.
Inputs:
camera_extrinsic: float32[4,4], representing the world-to-camera transformation matrix.camera_intrinsic: float324,4, encoding the camera’s projection parameters.- Optional control variables: sequence index, timestamp, or latent vector, each as float32 scalar or 1D tensor.
Outputs:
N(int32 scalar): specifies the number of Gaussian primitives produced per frame.gaussians: float16[N,13] packed tensor, each row parameterizing a 3D Gaussian with:- Mean: (3)
- Covariance (upper-triangular): (6)
- RGB color: (3)
- Opacity: (1)
meta(optional): Dictionary with fields such as"packed_dtype"("FP16"or"FP32") and upper bound for N.
The contract enforces that exactly N Gaussian tuples are output per frame. The use of a packed float16 layout minimizes upload bandwidth requirements; unpacking occurs on the GPU at runtime. Covariance is expressed in a compressed 6-value format corresponding to the independent elements of the symmetric matrix.
2. Mathematical Foundations of Gaussian Splatting
Each rendered primitive is modeled as an anisotropic Gaussian in , parameterized by weight , mean vector , and covariance matrix . Under camera projection , the 3D Gaussian maps to a 2D ellipse in image space:
- Projected mean:
- Projected covariance: , where
The influence on pixel is given by:
with view-independent color . Final color compositing proceeds front-to-back:
The generator network maps its outputs to these parameters: (output via sigmoid); (3-vector head); (6-vector head, upper-triangular entries, positivity enforced via softplus); (3-vector head, postprocessed if using spherical harmonics coefficients).
3. ONNX Inference Pipeline and Runtime Integration
Model deployment and inference occur directly within WebGPU-enabled browser contexts. The pipeline comprises several phases:
- Model Loading: Performed once at application initialization using
onnxruntime-web. The ONNX model is loaded, often with parameters quantized to FP16 and constants baked in at export (opset ≥ 14, constant folding enabled). - Warm-Up: A dummy inference primes the WebGPU execution graph and cache, optimizing subsequent per-frame executions.
- Per-Frame Schedule:
- Camera and optional control data are packed into WebGPU buffers.
- Synchronous/asynchronous forward pass via ONNX runtime, with outputs streamed directly as GPU buffers (avoiding CPU roundtrips).
- Gaussian buffer is bound as storage for GPU compute pre-processing: transform, cull, and radix-sort the Gaussian splats.
- Instanced draw call issues one triangle-strip per splat; fragment shader evaluates and performs compositing.
Reusing persistent WebGPU bind groups and ONNX session bindings enables predictable low-latency execution and amortized dispatch costs within the frame budget.
The following table summarizes the operational pipeline:
| Step | Action | Technology |
|---|---|---|
| Model load | ONNX import, session instantiate | onnxruntime-web |
| Warm-up | Dummy inference to cache WebGPU execution graph | WebGPU |
| Per-frame inference | Input assembly, neural decoding, GPU buffer bind | WebGPU, ONNX |
| Preprocessing | Splat preprocess, cull, radix-sort (O(N)) | WebGPU compute |
| Rendering | Instanced draw, fragment compositing | WebGPU fragment |
4. Network Architectures and Training Paradigms
Several network designs are supported:
- MLP-based 3DGS: Follows Scaffold-GS style. Input is per-anchor feature (, dimension 256–512) concatenated with view direction (, dimension 3). 4–6 fully connected layers with hidden units 256–512 (GELU or ReLU activations). Four output heads: (3 floats), (6 floats via softplus), (3 floats, either direct RGB or SH coefficients), and (1 float via sigmoid). Trained via L2 or L1 pixel loss plus regularization on scale. ONNX export uses dynamic axes for batch/view dimensions.
- 4D Gaussian Splatting: Temporal input (timestamp ) with canonical parameters and multi-scale feature planes. Feature lookup (bilinear sampling) is concatenated and fed to a small MLP (2–3 layers, 128 units). Outputs are delta mean (3), delta quaternion for rotation (4), and delta scale (3). The covariance updates as .
- Animatable Avatars (e.g., LHM, R3-Avatar): Input is SMPL-X pose (72D), shape (10D), and optional frame index. Internally, canonical , , and per-Gaussian skinning weights are used. Forward kinematics and LBS are included within the ONNX graph. Output is the deformed and in the current observation frame.
Training commonly employs mixed-precision (fp16) weights, opset ≥ 14, and constant folding. Large Concats/Splits are refactored in export to conform with WebGPU operational constraints.
5. Browser-Based Integration and Application Example
Integration is facilitated by a concise TypeScript API compatible with three.js and the “visionary-webgpu” library. The typical workflow involves initializing a WebGPU renderer and Gaussian renderer, loading the ONNX model, warm-up, and a main loop that, per frame, updates camera matrices, performs ONNX inference, streams results to GPU, and dispatches render calls.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
import * as THREE from 'three';
import { InferenceSession, Tensor } from 'onnxruntime-web';
import { WebGPURenderer, GaussianRenderer } from 'visionary-webgpu';
async function initVisionary(canvas: HTMLCanvasElement) {
const renderer = new WebGPURenderer(canvas, { useCompute: true });
const gaussRenderer = new GaussianRenderer(renderer);
const session = await InferenceSession.create('/models/gauss_gen.onnx', {
executionProviders: ['webgpu']
});
const dummyExtrinsic = new Tensor('float32', new Float32Array(16), [4,4]);
const dummyIntrinsic = new Tensor('float32', new Float32Array(16), [4,4]);
await session.run({ camera_extrinsic: dummyExtrinsic, camera_intrinsic: dummyIntrinsic });
function frame(time: number) {
const cam = camera.matrixWorldInverse;
const proj = camera.projectionMatrix;
const extrinsicTensor = new Tensor('float32', cam.toArray(), [4,4]);
const intrinsicTensor = new Tensor('float32', proj.toArray(), [4,4]);
session.run({
camera_extrinsic: extrinsicTensor,
camera_intrinsic: intrinsicTensor,
// optional control...
}).then((outputs) => {
const gaussBuffer = outputs.gaussians.data as Uint16Array;
const count = outputs.N.data[0] as number;
gaussRenderer.updatePrimitiveBuffer(gaussBuffer, count);
renderer.beginFrame();
gaussRenderer.draw();
renderer.endFrame();
});
requestAnimationFrame(frame);
}
requestAnimationFrame(frame);
} |
This workflow maintains all data and computation on the GPU, minimizing latency and eliminating CPU-GPU roundtrips.
6. Performance Characteristics and Throughput
The ONNX+WebGPU approach yields high throughput and low latency:
- Per-frame preprocessing—frustum culling, clip, ellipse axes—runs in a single WebGPU compute pass.
- Depth keys and indices are updated via GPU atomics, and a GPU radix sort organizes splats in runtime, achieving sub-millisecond times for millions of points.
- Instanced triangle-strip draws and fragment-based Gaussian summation enable efficient compositing and blending.
- Captured and replayed ONNX execution graphs result in stable dispatch overhead, with reported per-frame decoding times:
- Scaffold-GS: 2.5M splats 9 ms
- 4DGS: 0.06M splats 8 ms
- Avatars: 0.2M splats 30 ms
- Aggregate end-to-end frame times for static 6M Gaussians are 2 ms, representing approximately 100 speedup over CPU-based WebGL sorting.
- The entire pipeline remains within a single-frame time budget, leveraging unified memory architectures for device-local execution and rapid updates (Gong et al., 9 Dec 2025).
A plausible implication is that the ONNX-based Gaussian Generator contract enables real-time, browser-native world model rendering and generative processing suitable for both reconstruction and synthetic content, with substantial practical advantages for reproducibility and deployment.