Survey on Deep Neural Networks in Speech and Vision Systems

Published 16 Aug 2019 in cs.CV, cs.LG, cs.NE, cs.SD, eess.AS, eess.SP, and stat.ML | (1908.07656v2)

Abstract: This survey presents a review of state-of-the-art deep neural network architectures, algorithms, and systems in vision and speech applications. Recent advances in deep artificial neural network algorithms and architectures have spurred rapid innovation and development of intelligent vision and speech systems. With availability of vast amounts of sensor data and cloud computing for processing and training of deep neural networks, and with increased sophistication in mobile and embedded technology, the next-generation intelligent systems are poised to revolutionize personal and commercial computing. This survey begins by providing background and evolution of some of the most successful deep learning models for intelligent vision and speech systems to date. An overview of large-scale industrial research and development efforts is provided to emphasize future trends and prospects of intelligent vision and speech systems. Robust and efficient intelligent systems demand low-latency and high fidelity in resource-constrained hardware platforms such as mobile devices, robots, and automobiles. Therefore, this survey also provides a summary of key challenges and recent successes in running deep neural networks on hardware-restricted platforms, i.e. within limited memory, battery life, and processing capabilities. Finally, emerging applications of vision and speech across disciplines such as affective computing, intelligent transportation, and precision medicine are discussed. To our knowledge, this paper provides one of the most comprehensive surveys on the latest developments in intelligent vision and speech applications from the perspectives of both software and hardware systems. Many of these emerging technologies using deep neural networks show tremendous promise to revolutionize research and development for future vision and speech systems.

Abstract PDF Upgrade to Chat

Citations (187)

View on Semantic Scholar

Summary

The paper surveys diverse DNN architectures—including CNNs, RNNs, GANs, and VAEs—to highlight their effectiveness in processing speech and vision data.
The paper presents state-of-the-art applications with performance improvements in image classification, speech recognition, and generative modeling.
The paper addresses deployment challenges on resource-limited hardware by discussing model compression, pruning, and efficient architectural design.

An Expert Review of "Survey on Deep Neural Networks in Speech and Vision Systems"

The research article "Survey on Deep Neural Networks in Speech and Vision Systems" by M. Alam et al. is a comprehensive survey exploring the advancements, applications, and challenges of deep neural networks (DNNs) in the domains of speech and vision systems. The paper meticulously dissects various neural network architectures, including convolutional neural networks (CNNs), deep belief networks (DBNs), generative adversarial networks (GANs), variational autoencoders (VAEs), and recurrent neural networks (RNNs), detailing their contributions to processing human-centric data.

Overview of Deep Learning Architectures

DNNs have emerged as powerful tools due to their ability to learn hierarchical representations from large volumes of data, bypassing traditional 'hand-engineered' features. CNNs remain pivotal for vision tasks, excelling in object detection and image classification, with architectures like AlexNet, GoogLeNet, and ResNet setting benchmarks in performance on datasets such as ImageNet. Similarly, RNNs, particularly LSTMs, have become integral in speech recognition, overcoming limitations in modeling sequential data with temporal dependencies.

GANs and VAEs represent significant strides in generative modeling, enabling applications like image generation and enhancements in image and speech synthesis quality. The paper notes innovations like Wasserstein GAN (WGAN) which address GAN training instability, offering insights into their practical deployment.

State-of-the-Art Applications

The authors explore real-world applications, such as speech-to-text systems and automatic image recognition, highlighting both achievements and obstacles in hardware-constrained environments. In speech recognition, deep automatic speech recognition (ASR) models have achieved breakthroughs, with architecture like Deep Speech 2 nearing human equivalence in specific contexts.

Invision applications, advancements in CNN-based models have catalyzed substantial improvements in facial recognition, scene labeling, and pose estimation tasks. The paper presents performance data, emphasizing reductions in error rates across different tasks and datasets, underlining the evolving capabilities of DNNs.

Challenges in Implementation

Despite impressive strides, the survey acknowledges hindrances in deploying complex deep learning models on resource-restricted hardware like mobile devices. The paper points out the computational and memory limitations of such systems, necessitating innovations in model compression and hardware-software co-design. Techniques like model pruning, quantization, and efficient architectures such as MobileNets are discussed as potential solutions to enhance the feasibility of deploying sophisticated DNNs in low-power environments.

Theoretical and Practical Implications

The research underscores the importance of developing robust algorithms capable of learning from smaller datasets, addressing overfitting issues prevalent in deep models. It calls for future breakthroughs in handling high-dimensional data efficiently, particularly in 3D and 4D image processing, to further exploit the potential of DNNs in clinical applications.

From a practical perspective, the paper suggests that continued advancements in DNNs will further refine their role in numerous fields, including transportation, behavioral science, and medicine. The integration of DNNs in self-driving vehicles, precision medicine, and human-computer interaction frameworks is foreseen as a transformative influence on these disciplines.

The paper concludes with a cautious note on the interpretability and trust in AI systems, advocating that DNNs should be seen as complementary tools to human expertise rather than replacements. A balanced approach, recognizing the power and limitations of current AI systems, is essential as the shift towards intelligent systems continues.

Concluding Remarks

This survey provides a thorough analysis of the state of DNNs, both in terms of architectural developments and their application in speech and vision systems. As research and technology evolve, it will be crucial to address the computational challenges and maximize the utility of DNNs across diversified domains, fostering even broader adoption of AI-driven intelligent systems.