WoundNeRF: Neural 3D Wound Segmentation
- WoundNeRF is an SDF-based neural field system that integrates multi-view RGB images to produce a continuous, multi-view consistent 3D wound segmentation.
- It leverages dual MLPs—one for geometry and one for appearance—to convert 3D coordinates and view directions into accurate volumetric reconstructions.
- Robust training combines fine-tuning on noisy 2D annotations with volumetric and semantic losses, delivering superior segmentation accuracy over traditional methods.
WoundNeRF is a signed distance function (SDF)-based neural field system for multi-view consistent 3D wound segmentation from standard RGB images. Developed within the NeRF + SDF framework and augmented by a semantic decoder, WoundNeRF aggregates automatically generated 2D wound annotations into a unified 3D segmentation, addressing longstanding challenges in acquiring robust, view-consistent representations of wound-bed tissue from sparse and noisy clinical imagery (Chierchia et al., 23 Jan 2026).
1. Underlying Framework and Model Architecture
WoundNeRF operates within the Neural Radiance Fields (NeRF) + Signed Distance Function (SDF) paradigm, closely following the architecture of NeuS [wang2021neus], but with additional semantic segmentation capability. The backbone comprises two multilayer perceptrons (MLPs):
- Geometry MLP: Receives 3D coordinates ; outputs signed distance to the nearest surface and a latent feature vector . The SDF is translated to a volumetric density via , where is a learned scale.
- Appearance MLP: Consumes a 3D position and view direction ; outputs an RGB value .
Color rendering along camera rays is performed via volumetric integration: with accumulated transmittance
Through this mechanism, every input image contributes not just pixel information, but constraints on the underlying 3D semantic and geometric field.
2. Semantic Segmentation Extraction in 3D and 2D
The geometry MLP is extended with a semantic head mapping latent features to per-class logits across six categories: background and five wound-bed tissue classes (granulation, slough, necrotic, epithelia, unknown).
To facilitate wound-bed prediction, the five tissue class logits are aggregated using a log-sum-exp operation: Class probabilities are produced via softmax at each location.
The model yields pixel-wise segmentations by volume-rendering semantic probabilities along each ray: 3D wound masks are extracted by thresholding the SDF, , or by applying a probabilistic occupancy threshold: This dual 2D/3D representation enables straightforward projection into acquired viewpoints and robust spatial integration.
3. Data Annotation and Supervision Protocols
Annotation begins with fine-tuning a SegFormer model on a small expert-annotated subset (1–4 views per patient), generating “noisy” 2D masks for all 50 frames per wound video. These serve only as supervisory signals: the network learns a single 3D field whose projections must align with all available 2D masks.
Two objectives drive the optimization:
- Volumetric RGB Reconstruction:
with the true observed pixel color.
- Segmentation Consistency:
a weighted cross-entropy on rendered per-ray probabilities, mitigating class imbalance through class-weights .
This protocol ensures that spatial consistency is enforced by grounding all 2D supervisory information within a single, continuous 3D representation.
4. Optimization, Regularization, and Training Regime
The overall training objective is
where is an Eikonal regularizer: Training proceeds in two stages:
- Geometry MLP alone () for 50,000 iterations.
- Semantic head attached, full objective for an additional 50,000 iterations. Adam optimizer is used with , , , batch size 1024 rays.
5. Experimental Dataset, Preprocessing, and Evaluation
WoundNeRF was validated on 73 wound-videos from 35 patients, each yielding approximately 50 frames with known camera poses and a 3D mesh reconstructed using structure-from-motion. For ground truth, 1–4 frames per wound were manually annotated.
Preprocessing steps include undistortion, croppings centered on the wound, color correction, and pose refinement using COLMAP. Evaluation employs two complementary strategies:
- 3D Dice & Recall: Voxelization of predicted wound region, compared to a pseudo-ground-truth mesh built from expert masks.
- 2D Back-projection: Rendering predicted 3D masks back into GT-annotated camera views, enabling computation of 2D Dice (DSC) and Recall.
6. Quantitative Performance and Robustness Analysis
The performance of WoundNeRF is presented alongside two baselines: SegFormer (2D) and a heuristic fusion of 2D masks on a mesh (3D/2D). Metrics are reported for wound-bed DSC, granulation DSC, and slough DSC, with corresponding Recall.
| Method | Wound bed DSC | Recall | Granulation DSC | Recall | Slough DSC | Recall |
|---|---|---|---|---|---|---|
| 2D (SegFormer) | 0.851 | 0.819 | 0.738 | 0.689 | 0.670 | 0.609 |
| 3D/2D | 0.855 | 0.840 | 0.761 | 0.719 | 0.682 | 0.614 |
| Ours (w/o DO) | 0.851 | 0.859 | 0.767 | 0.764 | 0.691 | 0.658 |
| Ours | 0.857 | 0.893 | 0.775 | 0.786 | 0.686 | 0.666 |
Robustness experiments—boundary jitter, erosion/dilation, reduced frames—show markedly less degradation for WoundNeRF than rasterization-based methods, attributed to implicit spatial regularization. Qualitatively, segmentation boundaries are smoother and more consistent; fewer spurious holes appear compared to 3D/2D mesh fusion, and label flickering seen in 2D methods is eliminated.
7. Model Properties, Limitations, and Future Developments
WoundNeRF's strengths include its true multi-view consistency (a single, continuous SDF + semantic field), implicit spatial regularization via volume rendering, and robustness to annotation noise. Every prediction is rooted in a continuous function, bypassing ad-hoc fusion heuristics and improving wound area measurability.
Limitations are present: model quality relies on the initial 2D automatic annotations—large errors in supervision induce bias; training requires several hours on a single GPU, though inference is rapid post-training; unobserved wound regions are "hallucinated" by the network. Potential avenues for enhancement include confidence-driven active sampling, view-planning extensions, and integrating shape or photometric priors for improved generalizability in sparse-view regimes.
A plausible implication is that SDF-based neural fields, with semantic decoding and volumetric consistency, will continue to supplant mesh-based and purely 2D segmentation pipelines for complex medical analysis, given their demonstrated superiority in robustness, accuracy, and spatial continuity (Chierchia et al., 23 Jan 2026).