Synthetic Video Desmoking Dataset

Updated 9 December 2025

Synthetic Video Desmoking Dataset (STSVD) is a large-scale video resource that uses physics-based smoke augmentation and precise, temporally consistent annotations.
The dataset is generated via a wavelet-turbulence fluid model coupled with a cavity-pressure solver to simulate anatomically realistic smoke effects.
STSVD offers balanced multi-class smoke types and high-resolution, supervised video sequences that enable advanced spatio-temporal benchmarking in surgical smoke removal.

A synthetic video desmoking dataset is a resource constructed by algorithmically compositing simulated smoke onto real, smoke-free laparoscopic video frames, paired with pixel-perfect and temporally consistent smoke annotations. The Synthetic Video Desmoking Dataset (STSVD) is the first large-scale video dataset for this domain that provides explicit smoke-type annotations and models fine-grained physical phenomena, supporting research on the recognition, segmentation, and removal of surgical smoke in endoscopic imagery (Liang et al., 2 Dec 2025).

1. Dataset Generation Pipeline

STSVD is created via physics-based synthetic augmentation of laparoscopic imaging data. Clean source videos at 720×1080 resolution are sampled from Cholec80, M2CAI16, and Hamlyn datasets. For each clean clip, smoke is introduced using a wavelet-turbulence-driven fluid model, following Kim et al. (2008), augmented with a cavity-pressure solver to account for the pressurized environment typical in minimally invasive surgery.

At each frame $t$ , a 2D smoke opacity mask $M(t)$ is computed with a volumetric renderer implementing the radiative transfer equation under a single-scattering assumption: $I_\text{smoke}(x,y,t) = I_c(x,y,t)\;e^{-\,\beta\,\ell(x,y,t)} + L_a\bigl(1 - e^{-\,\beta\,\ell(x,y,t)}\bigr)$ where $I_c$ is the underlying clean frame, $\beta$ the per-frame scattering coefficient, $\ell(x,y,t)$ the local smoke depth, and $L_a$ is set to $[255,255,255]$ to model atmospheric scattering. This model is discretized by rendering 8-bit masks, then composited with the clean video: $I_s(t) = I_c(t) + \omega\,\mathrm{Aug}\left((L_a - I_c(t))\,\frac{M(t)}{255}\right)$ Here, $\omega$ is derived from per-pixel opacity, and $\mathrm{Aug}$ introduces post-processing (motion blur, chromatic shift) to simulate camera artifacts. Diffusion smoke origin is tightly aligned to detected cautery tool tips via a CNN-based detector, ensuring anatomical plausibility. The rendering pipeline guarantees per-pixel mask-to-image registration; therefore, no manual alignment or correction is required.

2. Dataset Composition

STSVD contains 120 video sequences, each 100 frames in length (3.3 seconds at 30 fps), at a fixed resolution of 720×1080. All sequences are generated for supervised training, without official train/validation/test splits in the release. Each video is classified as one of three smoke types: Diffusion, Ambient, or Entangled, with each type evenly represented:

Smoke Type	Videos	Frames	% of Dataset
Diffusion	40	4,000	33.3%
Ambient	40	4,000	33.3%
Entangled	40	4,000	33.3%

This balanced breakdown enables direct comparative analysis across all smoke categories.

3. Labeling and Annotation Protocol

Annotation in STSVD is automated, exploiting the rendering pipeline. For each frame $t$ , two float-valued, 8-bit PNG opacity masks are provided:

$M_{\rm diff}(t)$ : per-pixel opacity for diffusion smoke
$M_{\rm amb}(t)$ : per-pixel opacity for ambient smoke

Each video is labeled at the sequence level with its dominant smoke type {\texttt{diffusion, ambient, entangled}}. These masks are temporally indexed and adhere to a strict naming convention to guarantee alignment. Since all annotations are rendered computationally, traditional metrics such as manual inter-rater agreement or human correction logs are inapplicable; pixel-wise accuracy is inherent. For optional evaluation, Intersection-over-Union (IoU) may be computed per mask: $\mathrm{IoU}(M_{\rm pred},M_{\rm gt}) = \frac{\sum_{x,y} \mathbf{1}\{M_{\rm pred}(x,y)>0.5 \wedge M_{\rm gt}(x,y)>0.5\}}{\sum_{x,y} \mathbf{1}\{M_{\rm pred}(x,y)>0.5 \vee M_{\rm gt}(x,y)>0.5\}}$

4. Descriptive Statistics and Visualization

While aggregate mask coverage statistics are not directly tabulated, frame-wise and sequence-level coverage can be formalized: $\alpha_{\typ} = \frac{1}{T\,H\,W}\sum_{t=1}^T \sum_{i=1}^H \sum_{j=1}^W \mathbf{1}(M_{\typ}(t,i,j) > 0)$ This quantity gives the mean per-pixel smoke presence for each type, across the dataset. Event duration can be inferred by consecutive frames with nonzero mask coverage.

Exemplar frames with overlaid binary masks are presented in the main paper (Fig. 1) to differentiate between smoke types, and in Supplementary Fig. A.4, which displays multitype, anatomically diverse, and temporally consistent scenarios. Color mapping conventionally annotates red for diffusion and blue for ambient smoke.

5. Access, Distribution, and Licensing

STSVD is hosted at https://simon-leong.github.io/STSVD/. The dataset includes videos and all mask annotations. License is restricted to “Academic use only”; precise terms are detailed on the project website. There is no restriction on mask use for method development or benchmarking, except as governed by the academic license.

6. Comparative Analysis With Existing Datasets

A distinguishing feature of STSVD lies in its video-based, physics-driven design, contrasting with prior synthetic smoke datasets, which are limited to image-level, single-type, low-resolution data:

Dataset	Type	Resolution	Frames	# Smoke Types	Turbulence / Pressure
MARS-GAN	Image	256×256	18,000	1	no / no
PFAN	Image	480×480	660	1	no / no
PSv2rs	Image	256×256	54,420	1	no / no
STSVD	Video	720×1080	12,000	3	yes / yes

STSVD uniquely offers:

Temporal continuity (100-frame video clips), enabling spatio-temporal learning.
Clinical-scale definition (720×1080), matching endoscopic system output.
Coverage of three distinct smoke physical regimes: diffusion, ambient, and entangled.
Simulation based on both turbulence and cavity pressure, for greater physical realism.
Automatic, tool-tip–localized injection sites for accurate anatomical context.

Together, these factors establish STSVD as the first resource to enable supervised and benchmarked development of smoke-type-aware video desmoking algorithms (Liang et al., 2 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Synthetic Video Desmoking Dataset.