Depth Watertight Mesh (DW-Mesh)

Updated 22 January 2026

DW-Mesh is a watertight 3D surface representation built from depth data, forming closed, manifold meshes with explicit occlusion encoding.
It uses depth unprojection, boundary capping, and consistent triangulation to deliver photorealistic and artifact-free mesh reconstructions.
DW-Mesh underpins efficient multiview textured recovery and 4D video synthesis by ensuring stable geometry across varying viewpoints.

A Depth Watertight Mesh (DW-Mesh) is a 3D surface representation constructed from estimated depth data that enforces watertightness and explicit occlusion encoding, enabling robust geometric priors for tasks such as multiview textured mesh recovery and monocular-to-4D video synthesis. DW-Mesh forms a closed, manifold mesh structure by leveraging depth unprojection, boundary capping, occlusion detection, and face connectivity strategies, and is designed to remain free of geometric inconsistencies (e.g., holes, boundary artifacts) under arbitrary viewpoints. This approach has been rigorously adopted for efficient, photometrically faithful mesh reconstruction from multi-view or monocular image inputs, serving as the geometric backbone in recent systems such as differentiable rendering pipelines (Lin et al., 2022) and as the core spatial representation for learning-based 4D video frameworks (Hu et al., 5 Jun 2025).

1. Mathematical Foundations and Formal Definition

For a given image (or video frame) $I_t$ , a DW-Mesh is defined as

$M_t = (V, F, T, O)$

where:

$V = \{V_{i,j}\} \subset \mathbb{R}^3$ is the set of vertices, one per input pixel $(i,j)$ ,
$F = \{F_k\}$ is the set of triangular faces (each a 3-element subset of $V$ ),
$T = \{T_k\}$ are per-face RGB textures,
$O = \{O_k\} \subset \{0,1\}$ denotes a binary occlusion/degeneracy flag per face.

Given a per-pixel depth map $D_t \in \mathbb{R}^{H \times W}$ , each vertex is unprojected by: $V_{i,j} = o + D_{i,j} \cdot r_{i,j}$ where $o \in \mathbb{R}^3$ is camera origin and $r_{i,j}$ are ray directions. Boundary pixels $(i,j) \in \partial I$ are set to a large depth $D_{\max}$ to ensure surface closure, and two triangular "cap" faces are added between the border corners.

Face connectivity is standardized: each $2 \times 2$ pixel block forms two triangles

$F_{i,j,1} = \{V_{i,j}, V_{i+1,j}, V_{i,j+1}\}, \quad F_{i,j,2} = \{V_{i+1,j}, V_{i+1,j+1}, V_{i,j+1}\}$

Occlusion/degeneracy is determined by: $O_{i,j} = \begin{cases} 1, & \min_{\alpha \in \mathrm{angles}(F_{i,j})} \alpha < \delta_{\mathrm{angle}} \ \lor \ \max_{(u,v) \in F_{i,j}} |D_u - D_v| > \delta_{\mathrm{depth}}\ 0, & \text{otherwise} \end{cases}$ Where $\delta_{\mathrm{angle}}$ and $\delta_{\mathrm{depth}}$ are geometric thresholds. Degenerate (occluded) faces ( $O_{i,j}=1$ ) are textured as black ( $[0, 0, 0]$ ), otherwise original pixel color.

The result is a watertight, manifold mesh that explicitly models both visible and occluded surface regions (Hu et al., 5 Jun 2025).

2. DW-Mesh Construction Algorithms

The DW-Mesh construction pipeline consists of the following deterministic and differentiable stages, tailored for multiview mesh recovery and 4D video synthesis:

Step 1: Depth Prediction

For single images/monocular video, apply a pre-trained depth estimator to obtain $D_t$ per frame.
For multi-view settings, multi-view stereo (MVS) produces depth hypotheses $D_i^{\mathrm{pred}}$ .

Step 2: Boundary Padding

Set all border pixels’ depth to large constant $D_{\mathrm{max}}$ , ensuring closure.

Step 3: Vertex Unprojection

Unproject each pixel to 3D via camera intrinsics/extrinsics: $V_{i,j} = o + D_{i,j} r_{i,j}$ .

Step 4: Face Construction

Connect vertices into face sets $\{F_k\}$ using a consistent triangulation, and add cap faces.

Step 5: Occlusion Detection and Texturing

For each face, compute minimum internal angle and maximum depth jump.
Set occlusion flag $O_{i,j}$ accordingly.
Assign per-face texture $T_{i,j}$ as black ( $[0,0,0]$ ) if occluded or degenerate, otherwise original color.

Step 6: Mesh Assembly

Collate $(V, F, T, O)$ to form the frame mesh $M_t$ .

This mesh can be rendered from arbitrary viewpoints for mask/raster prior generation, or supplied as a geometric proxy for further scene understanding (Hu et al., 5 Jun 2025).

3. DW-Mesh in Multi-View Textured Mesh Recovery

DW-Mesh serves as the foundation of efficient multiview surface and texture reconstruction pipelines, such as in differentiable rendering with physically-based losses. The workflow is characterized by a coarse-to-fine optimization:

Initialization: Volumetric visual hull constructed at $128^3$ , initial mesh extracted via marching cubes.
Coarse Shape Optimization: Sample 10,000 oriented points from mesh, use a differentiable Poisson Solver (spectral domain Poisson surface reconstruction) to yield an SDF, followed by differentiable marching cubes. Render depth/silhouettes and match to MVS predictions and observed silhouettes via backpropagation.
Fine Shape and Texture Optimization: Increase SDF grid and iterate with a learnable 7-channel dense texture volume, trilinearly interpolated for per-vertex materials. Jointly optimize geometry and texture using losses for depth, silhouettes, and photometric consistency under a Cook–Torrance BRDF and environment lighting.
Export: Final mesh is watertight, per-vertex materialized, and directly usable for real-time relighting (Lin et al., 2022).

This pipeline explicitly leverages depth-based watertight priors to regularize the geometry and ensures high-resolution, topologically robust reconstructions.

4. Masking and Supervision Strategies

In EX-4D (Hu et al., 5 Jun 2025), the DW-Mesh enables simulated multi-view training signals from monocular videos via mask synthesis:

Rendering masks: Rasterize $M_t$ from novel viewpoints to obtain binary masks, then apply morphological dilation:

$\widetilde{m}_{t'} = \mathrm{Dilate}(m_{t'}, K_{5\times 5})$

These masks define visible regions for pseudo multi-view supervision.

Tracking masks: Keypoints are tracked across time, and rectangular masks are drawn for persistent simulated occlusions, ensuring temporally consistent, geometry-aware training signals.

All masks and masked color frames are fed as conditions into a video diffusion model for synthesizing physically consistent, camera-controllable 4D outputs.

5. Model Architecture and Optimization Schemes

DW-Mesh is integrated within both optimization-driven and learning-based systems:

Multiview Mesh Learning (Lin et al., 2022):

SDF/mesh training alternates between geometry refinement (Adam optimizer, $r=128 \rightarrow 256$ , epochs: 150+150, sample counts: 10,000/60,000 points) and texture learning (dense grid upsampled from $128^3$ to $256^3$ ), with joint depth/silhouette/photometric losses.

Video Diffusion Adapter (Hu et al., 5 Jun 2025):

Frozen diffusion backbones are conditioned by a LoRA-based adapter (140 M parameters, LoRA rank $r=16$ ), trained using standard denoising diffusion objective.
Video–VAE encodes prior information from color-masked and binary mask videos, merged via Conv3D layers and injected at each block of the UNet.
Mask-aware priors enhance 4D video consistency under camera motions up to ±90°, with only the small adapter/LoRA weights trained.

No additional explicit regularization is imposed beyond mask-driven and geometry-driven consistency, relying on the mesh’s inherent properties and data augmentations (cropping along Bezier paths) to enforce stability.

6. Empirical Performance and Evaluation

DW-Mesh-based pipelines exhibit strong quantitative and qualitative results:

Multi-View Mesh Recovery (Lin et al., 2022):

DTU dataset: Chamfer distance of 0.68 mm, outperforming IDR (0.90 mm), MVSDF (0.88 mm), Vis-MVS (0.88 mm); PSNR 26.04 dB, SSIM 0.91, rendering at 0.04 s @1600×1200 (<30× faster than alternatives).
EPFL dataset: Chamfer ×10⁻² of 6.01, PSNR 27.52 dB, consistently better than earlier mesh and implicit methods.
Ablations: Depth supervision is essential; omitting it degrades Chamfer distance (0.68 → 3.46 mm). Silhouette loss eliminates spurious surface artifacts. Visual hull/sphere initialization yields equivalent outcomes.
Qualitative: Topology-agnostic, watertight, and high-detail models with robust real-time relighting.

4D Video Generation (Hu et al., 5 Jun 2025):

EX-4D demonstrates superior physical and geometric consistency in camera-controllable videos from monocular input, with DW-Mesh preventing boundary artifacts and holes, and providing accurately rendered occlusions across extreme viewpoints and camera trajectories.

7. Significance and Applications

The DW-Mesh construct has become pivotal for:

Efficient, high-fidelity multiview or monocular mesh recovery with explicit topology guarantees.
Generation of physically and geometrically consistent 4D videos under extreme viewpoint changes, where occlusion modeling is essential.
Enabling robust, mask-driven self-supervision for learning frameworks in the absence of true multi-view data.
Serving as an export-ready format for real-time rendering and relighting, bridging optimization-based and generative paradigms.

The explicit encoding of occluded regions and watertightness principles inherent to DW-Mesh ensure geometric regularity under novel camera poses, making it a foundational primitive for contemporary vision, graphics, and 4D scene synthesis pipelines (Lin et al., 2022, Hu et al., 5 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Multiview Textured Mesh Recovery by Differentiable Rendering (2022)

EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth Watertight Mesh (DW-Mesh).