Visual Wake Words Dataset

Published 12 Jun 2019 in cs.CV and eess.IV | (1906.05721v1)

Abstract: The emergence of Internet of Things (IoT) applications requires intelligence on the edge. Microcontrollers provide a low-cost compute platform to deploy intelligent IoT applications using machine learning at scale, but have extremely limited on-chip memory and compute capability. To deploy computer vision on such devices, we need tiny vision models that fit within a few hundred kilobytes of memory footprint in terms of peak usage and model size on device storage. To facilitate the development of microcontroller friendly models, we present a new dataset, Visual Wake Words, that represents a common microcontroller vision use-case of identifying whether a person is present in the image or not, and provides a realistic benchmark for tiny vision models. Within a limited memory footprint of 250 KB, several state-of-the-art mobile models achieve accuracy of 85-90% on the Visual Wake Words dataset. We anticipate the proposed dataset will advance the research on tiny vision models that can push the pareto-optimal boundary in terms of accuracy versus memory usage for microcontroller applications.

Abstract PDF Upgrade to Chat

Citations (92)

View on Semantic Scholar

Summary

The paper introduces a novel dataset designed for person detection on microcontrollers, addressing memory and compute limitations.
It benchmarks several CNN architectures, showing that optimized models can achieve up to 90% accuracy despite ultra-low resource usage.
The study highlights the potential of tailored edge AI solutions, setting a precedent for efficient, low-power vision applications in IoT.

An Analysis of the Visual Wake Words Dataset for Microcontroller Vision

The paper entitled "Visual Wake Words Dataset" addresses a pivotal facet of integrating intelligence into IoT devices through the deployment of computer vision models on microcontrollers. Despite the constrained compute capabilities, memory, and energy resources of microcontrollers, they present an economically viable solution for numerous IoT applications requiring low-power and low-latency inference. The authors propose the Visual Wake Words Dataset to benchmark vision models aimed at identifying the presence of a person within an image — a typical microcontroller application. This dataset is derived from the publicly available COCO dataset, providing a realistic benchmark for tiny vision models under stringent memory requirements.

Core Contributions

The paper delineates the challenges inherent in deploying CNNs on microcontrollers due to their limited memory (typically 100--320 KB SRAM) and modest flash storage (256 KB--1 MB). Recognizing this, the paper focuses on creating vision models constrained to a memory footprint of 250 KB, which simultaneously achieve high inference accuracy and operational utility under 60 million multiply-add operations per inference.

The sine qua non contribution is the introduction of the Visual Wake Words dataset, which supplies labels to images based on the presence or absence of a person — a common, resource-efficient task similar to audio wake word recognition. The authors underscore the inadequacy of existing benchmarks (e.g., ImageNet and CIFAR10) for microcontroller application, citing factors like the unnecessary breadth of ImageNet and the limited image resolution of CIFAR10. Consequently, this dataset facilitates the development of optimally small yet potent models suitable for microcontrollers.

Experimental Evaluation

The authors conduct extensive experiments using MobileNet V1, MobileNet V2, MNasNet, and ShuffleNet architectures to benchmark their models on both the ImageNet and the newly developed Visual Wake Words dataset. The experiments elucidate the trade-offs between accuracy and parameters such as peak memory usage, model size, and computational complexity (multiply-adds per inference).

Their results delineate that models achieving less than 60% accuracy on ImageNet can outperform themselves — reaching as high as 90% accuracy — on the Visual Wake Words dataset for the person classification task when appropriately constrained. This demonstrates the potential for finely tailored neural networks to satiate specific, microcontroller-based vision tasks effectively and economically.

Significance and Future Directions

The introduction of the Visual Wake Words dataset has crucial implications for the domain of edge AI. It promotes the exploration of highly specialized models fitting within the severe constraints of microcontroller systems, thus pushing the boundaries of ultra-low-power AI. By focusing on the critical intersection of accuracy and resource efficiency, it sets a new precedence for practical edge AI applications, encouraging the reconsideration of vision models' design paradigms.

Looking forward, this work beckons further exploration of compression techniques, model architecture optimization, and the exploration of alternative quantization methods to further dovetail AI's growing prowess into the restrictive confines of microcontroller hardware. The authors imply that expanding the dataset to encompass various object detection tasks could pave the way for broader IoT adoption, suggesting future research avenues to refine edge computing methodologies empowered by AI.

In summary, this paper lays foundational work that can significantly catalyze the advancement in deploying AI models on microcontrollers, fortifying the role of AI in the evolving landscape of ubiquitous IoT applications.