Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers

Published 22 Jul 2022 in cs.CV and cs.AI | (2207.12148v1)

Abstract: A 20% rise in car crashes in 2021 compared to 2020 has been observed as a result of increased distraction and drowsiness. Drowsy and distracted driving are the cause of 45% of all car crashes. As a means to decrease drowsy and distracted driving, detection methods using computer vision can be designed to be low-cost, accurate, and minimally invasive. This work investigated the use of the vision transformer to outperform state-of-the-art accuracy from 3D-CNNs. Two separate transformers were trained for drowsiness and distractedness. The drowsy video transformer model was trained on the National Tsing-Hua University Drowsy Driving Dataset (NTHU-DDD) with a Video Swin Transformer model for 10 epochs on two classes -- drowsy and non-drowsy simulated over 10.5 hours. The distracted video transformer was trained on the Driver Monitoring Dataset (DMD) with Video Swin Transformer for 50 epochs over 9 distraction-related classes. The accuracy of the drowsiness model reached 44% and a high loss value on the test set, indicating overfitting and poor model performance. Overfitting indicates limited training data and applied model architecture lacked quantifiable parameters to learn. The distracted model outperformed state-of-the-art models on DMD reaching 97.5%, indicating that with sufficient data and a strong architecture, transformers are suitable for unfit driving detection. Future research should use newer and stronger models such as TokenLearner to achieve higher accuracy and efficiency, merge existing datasets to expand to detecting drunk driving and road rage to create a comprehensive solution to prevent traffic crashes, and deploying a functioning prototype to revolutionize the automotive safety industry.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces vision transformer-based models employing spatiotemporal attention to detect distracted and drowsy driving.
It utilizes the Video Swin Transformer and benchmark datasets (NTHU-DDD, DMD) to evaluate performance, achieving 97.5% accuracy in distraction detection.
The study highlights overfitting challenges in drowsiness detection while emphasizing the potential of transformer models for enhancing automotive safety.

Spatiotemporal Attention for Unfit Driving Detection Using Vision Transformers

Introduction

The application of computer vision techniques in the automotive industry, specifically for detecting distracted and drowsy driving, is explored through the deployment of vision transformers. In response to the increasing rate of traffic accidents involving distracted or drowsy drivers—comprising 45% of car crashes—this study proposes the utilization of vision transformers as a low-cost, accurate, and minimally invasive detection mechanism. Previous methods employing EEGs, alcohol monitors, and lane detection systems are critiqued for being costly and impractical for comprehensive detection across various unfit driving categories. Vision transformers, as introduced in recent literature, hold significant promise for addressing these challenges by leveraging sophisticated attention mechanisms and adaptable processing architectures.

Figure 1: Difference in processing methods for convolutional neural networks and transformers. CNNs extract edges while the transformer extracts the most important spatial information from the input data.

Methodology

The study investigates two distinct scenarios using video-based classification models: drowsiness detection and distraction recognition. Utilizing the Video Swin Transformer architecture, known for its advanced handling of spatiotemporal data through shifted windows and tubelet embeddings, the research employs datasets NTHU-DDD and DMD for model training and evaluation.

Drowsy Driving Detection

For detecting drowsiness, the Video Swin Transformer processes video sequences from the National Tsing Hua University Drowsy Driving Dataset (NTHU-DDD). The data undergoes preprocessing stages including dimensionality reduction via CNN-based feature maps and encoding through transformer layers to capture per-pixel details. The model is trained over 10 epochs on a Tesla K80 GPU, striving for an optimal balance between computational efficiency and prediction accuracy.

Figure 2: Illustration of the vision transformer architecture adapted from \cite{Dosovitskiy2021}, inferencing on spatiotemporal data to output a prediction of drowsy/alert.

Distracted Driving Detection

In the distraction recognition scenario, the Driver Monitoring Dataset (DMD) serves as the benchmark, embodying nine classes related to driver distractions. The Video Swin Transformer utilizes a comprehensive patch extraction method, forming embeddings through overlapping spatial-temporal windows. This setup considers higher GFLOPs, accommodating the architecture's complexity and higher data fidelity requirements. Model training on a Tesla T4 GPU underscores the computational demands while delivering superior accuracy.

Results

The experiments reveal distinct outcomes for the two scenarios tested. The drowsy driving transformer struggled to surpass existing CNN approaches on NTHU-DDD, achieving only 44% accuracy—indicative of overfitting and insufficient data representation (exemplified by high loss values).

Figure 3: Training accuracy of the vision transformer on NTHU-DDD compared to the previous state-of-the-art with a CNN-based architecture.

Figure 4: Training and validation cross-entropy loss over 10 epochs for detecting drowsy driving on NTHU-DDD with Video Swin Transformer.

Conversely, for distracted driving detection on DMD, the Video Swin Transformer achieved a notable 97.5% accuracy, a marginal yet significant improvement over the prior state-of-the-art results, confirming the model's adequacy for distraction-related categorizations.

Figure 5: Training accuracy of the Video Swin Transformer on DMD compared to the previous state-of-the-art \cite{canas_dmd}.

Conclusion

This study presents an initial exploration into deploying vision transformers for detecting various forms of unfit driving, with promising outcomes particularly for distraction recognition. The capability of vision transformers to outperform traditional CNN methods in certain contexts underscores their potential in automotive safety systems. Nonetheless, challenges remain in extending such success across all categories of unfit driving, particularly for drowsiness detection. Future work should focus on enriching datasets, incorporating newer transformer variants like TokenLearner, and integrating models into scalable real-world solutions, including mobile device deployments. As automation and driver monitoring systems advance, these techniques may substantially impact automotive safety measures and legal requirements mandating driver monitoring systems.

Markdown Report Issue