Synthetic Video Desmoking Dataset
- Synthetic Video Desmoking Dataset (STSVD) is a large-scale video resource that uses physics-based smoke augmentation and precise, temporally consistent annotations.
- The dataset is generated via a wavelet-turbulence fluid model coupled with a cavity-pressure solver to simulate anatomically realistic smoke effects.
- STSVD offers balanced multi-class smoke types and high-resolution, supervised video sequences that enable advanced spatio-temporal benchmarking in surgical smoke removal.
A synthetic video desmoking dataset is a resource constructed by algorithmically compositing simulated smoke onto real, smoke-free laparoscopic video frames, paired with pixel-perfect and temporally consistent smoke annotations. The Synthetic Video Desmoking Dataset (STSVD) is the first large-scale video dataset for this domain that provides explicit smoke-type annotations and models fine-grained physical phenomena, supporting research on the recognition, segmentation, and removal of surgical smoke in endoscopic imagery (Liang et al., 2 Dec 2025).
1. Dataset Generation Pipeline
STSVD is created via physics-based synthetic augmentation of laparoscopic imaging data. Clean source videos at 720×1080 resolution are sampled from Cholec80, M2CAI16, and Hamlyn datasets. For each clean clip, smoke is introduced using a wavelet-turbulence-driven fluid model, following Kim et al. (2008), augmented with a cavity-pressure solver to account for the pressurized environment typical in minimally invasive surgery.
At each frame , a 2D smoke opacity mask is computed with a volumetric renderer implementing the radiative transfer equation under a single-scattering assumption: where is the underlying clean frame, the per-frame scattering coefficient, the local smoke depth, and is set to to model atmospheric scattering. This model is discretized by rendering 8-bit masks, then composited with the clean video: Here, is derived from per-pixel opacity, and introduces post-processing (motion blur, chromatic shift) to simulate camera artifacts. Diffusion smoke origin is tightly aligned to detected cautery tool tips via a CNN-based detector, ensuring anatomical plausibility. The rendering pipeline guarantees per-pixel mask-to-image registration; therefore, no manual alignment or correction is required.
2. Dataset Composition
STSVD contains 120 video sequences, each 100 frames in length (3.3 seconds at 30 fps), at a fixed resolution of 720×1080. All sequences are generated for supervised training, without official train/validation/test splits in the release. Each video is classified as one of three smoke types: Diffusion, Ambient, or Entangled, with each type evenly represented:
| Smoke Type | Videos | Frames | % of Dataset |
|---|---|---|---|
| Diffusion | 40 | 4,000 | 33.3% |
| Ambient | 40 | 4,000 | 33.3% |
| Entangled | 40 | 4,000 | 33.3% |
This balanced breakdown enables direct comparative analysis across all smoke categories.
3. Labeling and Annotation Protocol
Annotation in STSVD is automated, exploiting the rendering pipeline. For each frame , two float-valued, 8-bit PNG opacity masks are provided:
- : per-pixel opacity for diffusion smoke
- : per-pixel opacity for ambient smoke
Each video is labeled at the sequence level with its dominant smoke type {\texttt{diffusion, ambient, entangled}}. These masks are temporally indexed and adhere to a strict naming convention to guarantee alignment. Since all annotations are rendered computationally, traditional metrics such as manual inter-rater agreement or human correction logs are inapplicable; pixel-wise accuracy is inherent. For optional evaluation, Intersection-over-Union (IoU) may be computed per mask:
4. Descriptive Statistics and Visualization
While aggregate mask coverage statistics are not directly tabulated, frame-wise and sequence-level coverage can be formalized: $\alpha_{\typ} = \frac{1}{T\,H\,W}\sum_{t=1}^T \sum_{i=1}^H \sum_{j=1}^W \mathbf{1}(M_{\typ}(t,i,j) > 0)$ This quantity gives the mean per-pixel smoke presence for each type, across the dataset. Event duration can be inferred by consecutive frames with nonzero mask coverage.
Exemplar frames with overlaid binary masks are presented in the main paper (Fig. 1) to differentiate between smoke types, and in Supplementary Fig. A.4, which displays multitype, anatomically diverse, and temporally consistent scenarios. Color mapping conventionally annotates red for diffusion and blue for ambient smoke.
5. Access, Distribution, and Licensing
STSVD is hosted at https://simon-leong.github.io/STSVD/. The dataset includes videos and all mask annotations. License is restricted to “Academic use only”; precise terms are detailed on the project website. There is no restriction on mask use for method development or benchmarking, except as governed by the academic license.
6. Comparative Analysis With Existing Datasets
A distinguishing feature of STSVD lies in its video-based, physics-driven design, contrasting with prior synthetic smoke datasets, which are limited to image-level, single-type, low-resolution data:
| Dataset | Type | Resolution | Frames | # Smoke Types | Turbulence / Pressure |
|---|---|---|---|---|---|
| MARS-GAN | Image | 256×256 | 18,000 | 1 | no / no |
| PFAN | Image | 480×480 | 660 | 1 | no / no |
| PSv2rs | Image | 256×256 | 54,420 | 1 | no / no |
| STSVD | Video | 720×1080 | 12,000 | 3 | yes / yes |
STSVD uniquely offers:
- Temporal continuity (100-frame video clips), enabling spatio-temporal learning.
- Clinical-scale definition (720×1080), matching endoscopic system output.
- Coverage of three distinct smoke physical regimes: diffusion, ambient, and entangled.
- Simulation based on both turbulence and cavity pressure, for greater physical realism.
- Automatic, tool-tip–localized injection sites for accurate anatomical context.
Together, these factors establish STSVD as the first resource to enable supervised and benchmarked development of smoke-type-aware video desmoking algorithms (Liang et al., 2 Dec 2025).