End-to-end Audiovisual Speech Recognition

Published 18 Feb 2018 in cs.CV | (1802.06424v2)

Abstract: Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-to-end audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW). The model consists of two streams, one for each modality, which extract features directly from mouth regions and raw waveforms. The temporal dynamics in each stream/modality are modeled by a 2-layer BGRU and the fusion of multiple streams/modalities takes place via another 2-layer BGRU. A slight improvement in the classification rate over an end-to-end audio-only and MFCC-based model is reported in clean audio conditions and low levels of noise. In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models.

Abstract PDF Upgrade to Chat

Citations (242)

View on Semantic Scholar

Summary

The paper introduces an innovative dual-stream architecture that processes raw video and audio concurrently using ResNets and BGRUs.
The methodology employs separate initial training for each modality followed by joint end-to-end optimization, achieving 98.0% accuracy in clean conditions and a 14.1% boost at -5 dB noise.
The results underscore the model’s robustness in challenging acoustic settings, paving the way for more resilient real-world speech recognition applications.

End-to-End Audiovisual Speech Recognition

This paper presents an innovative approach in the domain of audiovisual speech recognition, targeting a critical gap in end-to-end models that effectively integrate both visual and auditory information. While traditional systems often rely on separate stages for feature extraction and classification, the proposed method utilizes deep learning to accomplish these tasks concurrently through residual networks (ResNets) and Bidirectional Gated Recurrent Units (BGRUs).

Methodology and Architecture

The study introduces a dual-stream architecture, each stream dedicated to a modality — audio or video. The visual stream processes input directly from raw pixel data, specifically from a designated mouth region of interest (ROI), using a 34-layer ResNet followed by BGRUs that capture the temporal dynamics of speech. The audio stream employs an 18-layer ResNet to analyze raw waveforms, ensuring feature extraction from the raw audio signal itself. This is succeeded by a similar 2-layer BGRU configuration, optimizing temporal feature extraction. The architectures effectively converge via an additional set of BGRUs, enabling a comprehensive fusion of audio and video data to enhance speech recognition performance.

Data and Experimental Setup

Utilizing the Lip Reading in the Wild (LRW) database, the largest available dataset for lipreading, the model processes a substantial vocabulary of 500 words. The complexity of the dataset, marked by its extensive speaker diversity and challenging visual attributes such as head pose variations, positions this study as extensive and methodologically rigorous. The paper details a training strategy involving separate initial optimizations of each modality stream, subsequently integrating them through joint end-to-end training.

Results and Implications

The system achieves a 98.0% classification accuracy in clean audio conditions, marking a 0.3% improvement over standalone audio models and MFCC-based models. Although marginal, this enhancement is crucial in clean conditions where visual components contribute minimally. More notably, under high noise levels, the audiovisual model vastly outperforms audio-only systems, with up to a 14.1% increase in accuracy at -5 dB. This robustness to noise positions the proposed methodology as particularly valuable for real-world applications where audio quality can be compromised.

Conclusion and Future Directions

This research denotes a significant advancement in the integration of audiovisual modalities for speech recognition tasks, particularly under conditions challenging to audio-centric models. This work propels the industry toward more resilient multimodal recognition systems that can interpret speech accurately across diverse environments. Future research could further explore extending this system for complete sentence recognition rather than isolated word classification. Additionally, the potential development of adaptive fusion mechanisms could enhance model performance by dynamically weighting modalities based on contextual noise levels. Overall, this paper offers a substantial contribution to the audiovisual fusion research landscape, indicating promising pathways for ongoing advancements.

Markdown Report Issue