Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

Published 5 Sep 2024 in cs.SD, cs.AI, and eess.AS | (2409.03597v3)

Abstract: This paper presents the Multimodal Laryngoscopic Video Analyzing System (MLVAS), a novel system that leverages both audio and video data to automatically extract key video segments and metrics from raw laryngeal videostroboscopic videos for assisted clinical assessment. The system integrates video-based glottis detection with an audio keyword spotting method to analyze both video and audio data, identifying patient vocalizations and refining video highlights to ensure optimal inspection of vocal fold movements. Beyond key video segment extraction from the raw laryngeal videos, MLVAS is able to generate effective audio and visual features for Vocal Fold Paralysis (VFP) detection. Pre-trained audio encoders are utilized to encode the patient voice to get the audio features. Visual features are generated by measuring the angle deviation of both the left and right vocal folds to the estimated glottal midline on the segmented glottis masks. To get better masks, we introduce a diffusion-based refinement that follows traditional U-Net segmentation to reduce false positives. We conducted several ablation studies to demonstrate the effectiveness of each module and modalities in the proposed MLVAS. The experimental results on a public segmentation dataset show the effectiveness of our proposed segmentation module. In addition, unilateral VFP classification results on a real-world clinic dataset demonstrate MLVAS's ability of providing reliable and objective metrics as well as visualization for assisted clinical diagnosis.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Multimodal Laryngoscopic Video Analyzing System (MLVAS) that automates vocal fold assessments by combining audio keyword spotting and advanced video segmentation.
It employs a two-stage segmentation pipeline using U-Net and diffusion model refinement, which significantly improves segmentation accuracy and reduces false positives.
Empirical results on clinical datasets demonstrate robust performance in unilateral Vocal Fold Paralysis detection, achieving higher IoU and lower false alarm rates.

Overview of Multimodal Laryngoscopic Video Analysis for Assisted Diagnosis of Vocal Fold Paralysis

The paper under discussion delineates the development of the Multimodal Laryngoscopic Video Analyzing System (MLVAS), which integrates both audio and video modalities to facilitate the assisted diagnosis of Vocal Fold Paralysis (VFP). The proposed system advances the clinical assessment process by automating the extraction of significant segments and metrics from raw laryngeal videostroboscopic inputs. This procedural enhancement is crucial given the high prevalence and potential morbidity associated with VFP, necessitating precise and reliable diagnostic tools.

System Architecture

The MLVAS framework comprises two core modules: an audio processing module and a video processing module.

Audio Processing: Utilizes a keyword spotting (KWS) technique via spectrogram analysis to identify phonation cycles indicated by the pronunciation of "e", thereby isolating relevant vocalization segments within the video stream. This step significantly reduces the time clinicians spend on reviewing non-informative content by directing attention specifically to phonation phases.
Video Processing: Incorporates a vocal fold detection mechanism using a YOLO-v5 architecture to confirm the visibility of vocal folds; furthermore, it employs an HSV fluctuation analysis to extract stroboscopic video portions which are vital for accurate vocal fold assessment.

Technical Innovations

The MLVAS system introduces several technical novelties:

Video Segmentation: The system deploys a two-stage glottis segmentation pipeline wherein an initial U-Net segmentation is refined using a diffusion model. This strategy effectively minimizes false positives, enhancing the quality and accuracy of segmentation outputs critical for subsequent clinical evaluations.
VFP Metrics: Introduces innovative metrics named Left and Right Vocal Fold Dynamics (LVFDyn and RVFDyn) for unilateral VFP (UVFP) diagnosis. These metrics are computed by evaluating the angle deviation of each vocal fold from an estimated midline, thereby enabling a differentiated diagnosis of left versus right VFP through comparative analysis.

Empirical Validation

The efficacy of the MLVAS was substantiated through rigorous experiments using the publicly available BAGLS dataset and a novel SYSU dataset comprising real-world clinical cases. The segmentation pipeline achieved notable improvements in Intersection over Union (IoU) scores, and significant reductions in False Alarm Rates (FAR) were observed owing to the diffusion-based refinement.

Moreover, the proposed system demonstrated robust performance in UVFP detection across varied settings, outperforming baseline models in accuracy, precision, recall, and F-score metrics. Statistical analyses confirmed the significance of improvements contributed by the refined segmentation techniques and the dual modality approach.

Implications and Future Directions

From a practical standpoint, the automated extraction of clinically relevant segments and metrics empowers clinicians with augmented diagnostic capabilities that are both precise and efficient. The integration of audio and visual cues as diagnostic complements offers a comprehensive depiction of vocal fold activity, thus facilitating a more nuanced assessment of VFP.

Theoretically, this research advances the domain of computational pathology by exemplifying the synergistic power of multimodal machine perception in medical diagnostics. Future work may expand upon this foundation by enhancing the robustness of audio-visual synchronization algorithms, exploring additional pathological conditions of the larynx, and refining real-time applicability across diverse clinical environments.

In conclusion, the Multimodal Laryngoscopic Video Analyzing System (MLVAS) represents a significant stride in leveraging high-fidelity data processing technologies for enhanced laryngeal diagnostics, ultimately contributing to improved patient outcomes.