Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Published 8 Apr 2024 in cs.CV | (2404.05466v2)

Abstract: Automatic lip-reading (ALR) aims to automatically transcribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first propose a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branchformer and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Achieving Human Parity in Conversational Speech Recognition,” arXiv preprint arXiv:1610.05256, 2016.
  2. “Attention is All You Need,” in Proc. NIPS. 2017, vol. 30, Curran Associates, Inc.
  3. “Audio-visual Speech Recognition based on Dual Cross-modality Attentions with the Transformer Model,” Applied Sciences, vol. 10, no. 20, pp. 7263, 2020.
  4. “Audio-visual Speech Recognition is Worth 32 X 32 X 8 Voxels,” in Proc. ASRU. IEEE, 2021, pp. 796–802.
  5. “Transformer-based Video Front-ends for Audio-Visual Speech Recognition for Single and Multi-person Video,” arXiv preprint arXiv:2201.10439, 2022.
  6. “Conformer: Convolution-Augmented Transformer for Speech Recognition,” in Proc. Interspeech. ISCA, 2020, pp. 5036–5040.
  7. “End-to-end chinese lip-reading recognition based on multi-modal fusion,” in Proc. ICFTIC. IEEE, 2022, pp. 794–801.
  8. “Conformer is All You Need for Visual Speech Recognition,” in Proc. ICASSP. IEEE, 2024, pp. 10136–10140.
  9. “End-to-End Audio-Visual Speech Recognition with Conformers,” in Proc. ICASSP. IEEE, 2021, pp. 7613–7617.
  10. “Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
  11. “The first Multimodal Information based Speech Processing (MISP) Challenge: Data, Tasks, Baselines and Results,” in Proc. ICASSP. IEEE, 2022, pp. 9266–9270.
  12. “The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
  13. “The Multimodal Information based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction,” in Proc. ICASSP. IEEE, 2024, pp. 8351–8355.
  14. “Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis,” in Proc. Interspeech. ISCA, 2022, pp. 1766–1770.
  15. “CN-CVS: A Mandarin Audio-Visual Dataset for Large Vocabulary Continuous Visual to Speech Synthesis,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
  16. “MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition,” in Proc. ICASSP, 2024, pp. 8150–8154.
  17. “The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023,” arXiv preprint arXiv:2401.06788, 2024.
  18. “Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding,” in Proc. ICML. PMLR, 2022, pp. 17627–17643.
  19. “E-branchformer: Branchformer with Enhanced Merging for Speech Recognition,” in Proc. SLT. IEEE, 2023, pp. 84–91.
  20. Jonathan G Fiscus, “A Post-processing System to Yield Reduced Word Error Rates: Recognizer output voting error reduction (ROVER),” in Proc. ASRU. IEEE, 1997, pp. 347–354.
  21. “Deep Residual Learning for Image Recognition,” in Proc. CVPR. IEEE/CVF, 2016, pp. 770–778.
  22. “Can we read speech beyond the lips? Rethinking ROI Selection for Deep Visual Speech Recognition,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 2020, pp. 356–363.
  23. “Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning,” in Proc. ICASSP. IEEE, 2017, pp. 4835–4839.
  24. “Espnet: End-to-End Speech Processing Toolkit,” in Proc. Interspeech. ISCA, 2018, pp. 2207–2211.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.