Depth Watertight Mesh (DW-Mesh)
- DW-Mesh is a watertight 3D surface representation built from depth data, forming closed, manifold meshes with explicit occlusion encoding.
- It uses depth unprojection, boundary capping, and consistent triangulation to deliver photorealistic and artifact-free mesh reconstructions.
- DW-Mesh underpins efficient multiview textured recovery and 4D video synthesis by ensuring stable geometry across varying viewpoints.
A Depth Watertight Mesh (DW-Mesh) is a 3D surface representation constructed from estimated depth data that enforces watertightness and explicit occlusion encoding, enabling robust geometric priors for tasks such as multiview textured mesh recovery and monocular-to-4D video synthesis. DW-Mesh forms a closed, manifold mesh structure by leveraging depth unprojection, boundary capping, occlusion detection, and face connectivity strategies, and is designed to remain free of geometric inconsistencies (e.g., holes, boundary artifacts) under arbitrary viewpoints. This approach has been rigorously adopted for efficient, photometrically faithful mesh reconstruction from multi-view or monocular image inputs, serving as the geometric backbone in recent systems such as differentiable rendering pipelines (Lin et al., 2022) and as the core spatial representation for learning-based 4D video frameworks (Hu et al., 5 Jun 2025).
1. Mathematical Foundations and Formal Definition
For a given image (or video frame) , a DW-Mesh is defined as
where:
- is the set of vertices, one per input pixel ,
- is the set of triangular faces (each a 3-element subset of ),
- are per-face RGB textures,
- denotes a binary occlusion/degeneracy flag per face.
Given a per-pixel depth map , each vertex is unprojected by: where is camera origin and are ray directions. Boundary pixels are set to a large depth to ensure surface closure, and two triangular "cap" faces are added between the border corners.
Face connectivity is standardized: each pixel block forms two triangles
Occlusion/degeneracy is determined by: Where and are geometric thresholds. Degenerate (occluded) faces () are textured as black (), otherwise original pixel color.
The result is a watertight, manifold mesh that explicitly models both visible and occluded surface regions (Hu et al., 5 Jun 2025).
2. DW-Mesh Construction Algorithms
The DW-Mesh construction pipeline consists of the following deterministic and differentiable stages, tailored for multiview mesh recovery and 4D video synthesis:
Step 1: Depth Prediction
- For single images/monocular video, apply a pre-trained depth estimator to obtain per frame.
- For multi-view settings, multi-view stereo (MVS) produces depth hypotheses .
Step 2: Boundary Padding
- Set all border pixels’ depth to large constant , ensuring closure.
Step 3: Vertex Unprojection
- Unproject each pixel to 3D via camera intrinsics/extrinsics: .
Step 4: Face Construction
- Connect vertices into face sets using a consistent triangulation, and add cap faces.
Step 5: Occlusion Detection and Texturing
- For each face, compute minimum internal angle and maximum depth jump.
- Set occlusion flag accordingly.
- Assign per-face texture as black () if occluded or degenerate, otherwise original color.
Step 6: Mesh Assembly
- Collate to form the frame mesh .
This mesh can be rendered from arbitrary viewpoints for mask/raster prior generation, or supplied as a geometric proxy for further scene understanding (Hu et al., 5 Jun 2025).
3. DW-Mesh in Multi-View Textured Mesh Recovery
DW-Mesh serves as the foundation of efficient multiview surface and texture reconstruction pipelines, such as in differentiable rendering with physically-based losses. The workflow is characterized by a coarse-to-fine optimization:
- Initialization: Volumetric visual hull constructed at , initial mesh extracted via marching cubes.
- Coarse Shape Optimization: Sample 10,000 oriented points from mesh, use a differentiable Poisson Solver (spectral domain Poisson surface reconstruction) to yield an SDF, followed by differentiable marching cubes. Render depth/silhouettes and match to MVS predictions and observed silhouettes via backpropagation.
- Fine Shape and Texture Optimization: Increase SDF grid and iterate with a learnable 7-channel dense texture volume, trilinearly interpolated for per-vertex materials. Jointly optimize geometry and texture using losses for depth, silhouettes, and photometric consistency under a Cook–Torrance BRDF and environment lighting.
- Export: Final mesh is watertight, per-vertex materialized, and directly usable for real-time relighting (Lin et al., 2022).
This pipeline explicitly leverages depth-based watertight priors to regularize the geometry and ensures high-resolution, topologically robust reconstructions.
4. Masking and Supervision Strategies
In EX-4D (Hu et al., 5 Jun 2025), the DW-Mesh enables simulated multi-view training signals from monocular videos via mask synthesis:
- Rendering masks: Rasterize from novel viewpoints to obtain binary masks, then apply morphological dilation:
These masks define visible regions for pseudo multi-view supervision.
- Tracking masks: Keypoints are tracked across time, and rectangular masks are drawn for persistent simulated occlusions, ensuring temporally consistent, geometry-aware training signals.
All masks and masked color frames are fed as conditions into a video diffusion model for synthesizing physically consistent, camera-controllable 4D outputs.
5. Model Architecture and Optimization Schemes
DW-Mesh is integrated within both optimization-driven and learning-based systems:
Multiview Mesh Learning (Lin et al., 2022):
- SDF/mesh training alternates between geometry refinement (Adam optimizer, , epochs: 150+150, sample counts: 10,000/60,000 points) and texture learning (dense grid upsampled from to ), with joint depth/silhouette/photometric losses.
Video Diffusion Adapter (Hu et al., 5 Jun 2025):
- Frozen diffusion backbones are conditioned by a LoRA-based adapter (140 M parameters, LoRA rank ), trained using standard denoising diffusion objective.
- Video–VAE encodes prior information from color-masked and binary mask videos, merged via Conv3D layers and injected at each block of the UNet.
- Mask-aware priors enhance 4D video consistency under camera motions up to ±90°, with only the small adapter/LoRA weights trained.
No additional explicit regularization is imposed beyond mask-driven and geometry-driven consistency, relying on the mesh’s inherent properties and data augmentations (cropping along Bezier paths) to enforce stability.
6. Empirical Performance and Evaluation
DW-Mesh-based pipelines exhibit strong quantitative and qualitative results:
Multi-View Mesh Recovery (Lin et al., 2022):
- DTU dataset: Chamfer distance of 0.68 mm, outperforming IDR (0.90 mm), MVSDF (0.88 mm), Vis-MVS (0.88 mm); PSNR 26.04 dB, SSIM 0.91, rendering at 0.04 s @1600×1200 (<30× faster than alternatives).
- EPFL dataset: Chamfer ×10⁻² of 6.01, PSNR 27.52 dB, consistently better than earlier mesh and implicit methods.
- Ablations: Depth supervision is essential; omitting it degrades Chamfer distance (0.68 → 3.46 mm). Silhouette loss eliminates spurious surface artifacts. Visual hull/sphere initialization yields equivalent outcomes.
- Qualitative: Topology-agnostic, watertight, and high-detail models with robust real-time relighting.
4D Video Generation (Hu et al., 5 Jun 2025):
- EX-4D demonstrates superior physical and geometric consistency in camera-controllable videos from monocular input, with DW-Mesh preventing boundary artifacts and holes, and providing accurately rendered occlusions across extreme viewpoints and camera trajectories.
7. Significance and Applications
The DW-Mesh construct has become pivotal for:
- Efficient, high-fidelity multiview or monocular mesh recovery with explicit topology guarantees.
- Generation of physically and geometrically consistent 4D videos under extreme viewpoint changes, where occlusion modeling is essential.
- Enabling robust, mask-driven self-supervision for learning frameworks in the absence of true multi-view data.
- Serving as an export-ready format for real-time rendering and relighting, bridging optimization-based and generative paradigms.
The explicit encoding of occluded regions and watertightness principles inherent to DW-Mesh ensure geometric regularity under novel camera poses, making it a foundational primitive for contemporary vision, graphics, and 4D scene synthesis pipelines (Lin et al., 2022, Hu et al., 5 Jun 2025).