Papers
Topics
Authors
Recent
Search
2000 character limit reached

Synthetic Video Desmoking Dataset

Updated 9 December 2025
  • Synthetic Video Desmoking Dataset (STSVD) is a large-scale video resource that uses physics-based smoke augmentation and precise, temporally consistent annotations.
  • The dataset is generated via a wavelet-turbulence fluid model coupled with a cavity-pressure solver to simulate anatomically realistic smoke effects.
  • STSVD offers balanced multi-class smoke types and high-resolution, supervised video sequences that enable advanced spatio-temporal benchmarking in surgical smoke removal.

A synthetic video desmoking dataset is a resource constructed by algorithmically compositing simulated smoke onto real, smoke-free laparoscopic video frames, paired with pixel-perfect and temporally consistent smoke annotations. The Synthetic Video Desmoking Dataset (STSVD) is the first large-scale video dataset for this domain that provides explicit smoke-type annotations and models fine-grained physical phenomena, supporting research on the recognition, segmentation, and removal of surgical smoke in endoscopic imagery (Liang et al., 2 Dec 2025).

1. Dataset Generation Pipeline

STSVD is created via physics-based synthetic augmentation of laparoscopic imaging data. Clean source videos at 720×1080 resolution are sampled from Cholec80, M2CAI16, and Hamlyn datasets. For each clean clip, smoke is introduced using a wavelet-turbulence-driven fluid model, following Kim et al. (2008), augmented with a cavity-pressure solver to account for the pressurized environment typical in minimally invasive surgery.

At each frame tt, a 2D smoke opacity mask M(t)M(t) is computed with a volumetric renderer implementing the radiative transfer equation under a single-scattering assumption: Ismoke(x,y,t)=Ic(x,y,t)  eβ(x,y,t)+La(1eβ(x,y,t))I_\text{smoke}(x,y,t) = I_c(x,y,t)\;e^{-\,\beta\,\ell(x,y,t)} + L_a\bigl(1 - e^{-\,\beta\,\ell(x,y,t)}\bigr) where IcI_c is the underlying clean frame, β\beta the per-frame scattering coefficient, (x,y,t)\ell(x,y,t) the local smoke depth, and LaL_a is set to [255,255,255][255,255,255] to model atmospheric scattering. This model is discretized by rendering 8-bit masks, then composited with the clean video: Is(t)=Ic(t)+ωAug((LaIc(t))M(t)255)I_s(t) = I_c(t) + \omega\,\mathrm{Aug}\left((L_a - I_c(t))\,\frac{M(t)}{255}\right) Here, ω\omega is derived from per-pixel opacity, and Aug\mathrm{Aug} introduces post-processing (motion blur, chromatic shift) to simulate camera artifacts. Diffusion smoke origin is tightly aligned to detected cautery tool tips via a CNN-based detector, ensuring anatomical plausibility. The rendering pipeline guarantees per-pixel mask-to-image registration; therefore, no manual alignment or correction is required.

2. Dataset Composition

STSVD contains 120 video sequences, each 100 frames in length (3.3 seconds at 30 fps), at a fixed resolution of 720×1080. All sequences are generated for supervised training, without official train/validation/test splits in the release. Each video is classified as one of three smoke types: Diffusion, Ambient, or Entangled, with each type evenly represented:

Smoke Type Videos Frames % of Dataset
Diffusion 40 4,000 33.3%
Ambient 40 4,000 33.3%
Entangled 40 4,000 33.3%

This balanced breakdown enables direct comparative analysis across all smoke categories.

3. Labeling and Annotation Protocol

Annotation in STSVD is automated, exploiting the rendering pipeline. For each frame tt, two float-valued, 8-bit PNG opacity masks are provided:

  • Mdiff(t)M_{\rm diff}(t): per-pixel opacity for diffusion smoke
  • Mamb(t)M_{\rm amb}(t): per-pixel opacity for ambient smoke

Each video is labeled at the sequence level with its dominant smoke type {\texttt{diffusion, ambient, entangled}}. These masks are temporally indexed and adhere to a strict naming convention to guarantee alignment. Since all annotations are rendered computationally, traditional metrics such as manual inter-rater agreement or human correction logs are inapplicable; pixel-wise accuracy is inherent. For optional evaluation, Intersection-over-Union (IoU) may be computed per mask: IoU(Mpred,Mgt)=x,y1{Mpred(x,y)>0.5Mgt(x,y)>0.5}x,y1{Mpred(x,y)>0.5Mgt(x,y)>0.5}\mathrm{IoU}(M_{\rm pred},M_{\rm gt}) = \frac{\sum_{x,y} \mathbf{1}\{M_{\rm pred}(x,y)>0.5 \wedge M_{\rm gt}(x,y)>0.5\}}{\sum_{x,y} \mathbf{1}\{M_{\rm pred}(x,y)>0.5 \vee M_{\rm gt}(x,y)>0.5\}}

4. Descriptive Statistics and Visualization

While aggregate mask coverage statistics are not directly tabulated, frame-wise and sequence-level coverage can be formalized: $\alpha_{\typ} = \frac{1}{T\,H\,W}\sum_{t=1}^T \sum_{i=1}^H \sum_{j=1}^W \mathbf{1}(M_{\typ}(t,i,j) > 0)$ This quantity gives the mean per-pixel smoke presence for each type, across the dataset. Event duration can be inferred by consecutive frames with nonzero mask coverage.

Exemplar frames with overlaid binary masks are presented in the main paper (Fig. 1) to differentiate between smoke types, and in Supplementary Fig. A.4, which displays multitype, anatomically diverse, and temporally consistent scenarios. Color mapping conventionally annotates red for diffusion and blue for ambient smoke.

5. Access, Distribution, and Licensing

STSVD is hosted at https://simon-leong.github.io/STSVD/. The dataset includes videos and all mask annotations. License is restricted to “Academic use only”; precise terms are detailed on the project website. There is no restriction on mask use for method development or benchmarking, except as governed by the academic license.

6. Comparative Analysis With Existing Datasets

A distinguishing feature of STSVD lies in its video-based, physics-driven design, contrasting with prior synthetic smoke datasets, which are limited to image-level, single-type, low-resolution data:

Dataset Type Resolution Frames # Smoke Types Turbulence / Pressure
MARS-GAN Image 256×256 18,000 1 no / no
PFAN Image 480×480 660 1 no / no
PSv2rs Image 256×256 54,420 1 no / no
STSVD Video 720×1080 12,000 3 yes / yes

STSVD uniquely offers:

  • Temporal continuity (100-frame video clips), enabling spatio-temporal learning.
  • Clinical-scale definition (720×1080), matching endoscopic system output.
  • Coverage of three distinct smoke physical regimes: diffusion, ambient, and entangled.
  • Simulation based on both turbulence and cavity pressure, for greater physical realism.
  • Automatic, tool-tip–localized injection sites for accurate anatomical context.

Together, these factors establish STSVD as the first resource to enable supervised and benchmarked development of smoke-type-aware video desmoking algorithms (Liang et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Synthetic Video Desmoking Dataset.