- The paper introduces the Multimodal Laryngoscopic Video Analyzing System (MLVAS) that automates vocal fold assessments by combining audio keyword spotting and advanced video segmentation.
- It employs a two-stage segmentation pipeline using U-Net and diffusion model refinement, which significantly improves segmentation accuracy and reduces false positives.
- Empirical results on clinical datasets demonstrate robust performance in unilateral Vocal Fold Paralysis detection, achieving higher IoU and lower false alarm rates.
Overview of Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis
The paper under discussion delineates the development of the Multimodal Laryngoscopic Video Analyzing System (MLVAS), which integrates both audio and video modalities to facilitate the assisted diagnosis of Vocal Fold Paralysis (VFP). The proposed system advances the clinical assessment process by automating the extraction of significant segments and metrics from raw laryngeal videostroboscopic inputs. This procedural enhancement is crucial given the high prevalence and potential morbidity associated with VFP, necessitating precise and reliable diagnostic tools.
System Architecture
The MLVAS framework comprises two core modules: an audio processing module and a video processing module.
- Audio Processing: Utilizes a keyword spotting (KWS) technique via spectrogram analysis to identify phonation cycles indicated by the pronunciation of "e", thereby isolating relevant vocalization segments within the video stream. This step significantly reduces the time clinicians spend on reviewing non-informative content by directing attention specifically to phonation phases.
- Video Processing: Incorporates a vocal fold detection mechanism using a YOLO-v5 architecture to confirm the visibility of vocal folds; furthermore, it employs an HSV fluctuation analysis to extract stroboscopic video portions which are vital for accurate vocal fold assessment.
Technical Innovations
The MLVAS system introduces several technical novelties:
- Video Segmentation: The system deploys a two-stage glottis segmentation pipeline wherein an initial U-Net segmentation is refined using a diffusion model. This strategy effectively minimizes false positives, enhancing the quality and accuracy of segmentation outputs critical for subsequent clinical evaluations.
- VFP Metrics: Introduces innovative metrics named Left and Right Vocal Fold Dynamics (LVFDyn and RVFDyn) for unilateral VFP (UVFP) diagnosis. These metrics are computed by evaluating the angle deviation of each vocal fold from an estimated midline, thereby enabling a differentiated diagnosis of left versus right VFP through comparative analysis.
Empirical Validation
The efficacy of the MLVAS was substantiated through rigorous experiments using the publicly available BAGLS dataset and a novel SYSU dataset comprising real-world clinical cases. The segmentation pipeline achieved notable improvements in Intersection over Union (IoU) scores, and significant reductions in False Alarm Rates (FAR) were observed owing to the diffusion-based refinement.
Moreover, the proposed system demonstrated robust performance in UVFP detection across varied settings, outperforming baseline models in accuracy, precision, recall, and F-score metrics. Statistical analyses confirmed the significance of improvements contributed by the refined segmentation techniques and the dual modality approach.
Implications and Future Directions
From a practical standpoint, the automated extraction of clinically relevant segments and metrics empowers clinicians with augmented diagnostic capabilities that are both precise and efficient. The integration of audio and visual cues as diagnostic complements offers a comprehensive depiction of vocal fold activity, thus facilitating a more nuanced assessment of VFP.
Theoretically, this research advances the domain of computational pathology by exemplifying the synergistic power of multimodal machine perception in medical diagnostics. Future work may expand upon this foundation by enhancing the robustness of audio-visual synchronization algorithms, exploring additional pathological conditions of the larynx, and refining real-time applicability across diverse clinical environments.
In conclusion, the Multimodal Laryngoscopic Video Analyzing System (MLVAS) represents a significant stride in leveraging high-fidelity data processing technologies for enhanced laryngeal diagnostics, ultimately contributing to improved patient outcomes.