Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis

Published 7 Jun 2025 in cs.CV | (2506.06886v1)

Abstract: Accurate Autism Spectrum Disorder (ASD) diagnosis is vital for early intervention. This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD using eye-tracking data. The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics. Unlike traditional handcrafted methods, it applies state-of-the-art deep learning and explainable AI techniques to enhance diagnostic accuracy and transparency. Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These findings show the model's promise for scalable, interpretable ASD screening, especially in resource-constrained or remote clinical settings where access to expert diagnosis is limited.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a hybrid Vision Transformer and Vision Mamba model that effectively captures spatial and temporal gaze patterns for autism diagnosis.
It employs advanced feature extraction and attention-based multimodal fusion, integrating eye movement with facial and speech cues for enhanced performance.
The framework achieves high diagnostic metrics, including 96% accuracy and 95% F1-score, while ensuring transparency through explainable AI components.

Overview of the Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis

The paper "Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis" presents a novel approach to diagnosing Autism Spectrum Disorder (ASD) using advanced deep learning techniques, specifically Vision Transformers (ViT) and Vision Mamba models. The approach leverages eye-tracking data to identify distinctive gaze patterns associated with ASD, highlighting the potential for scalable and interpretable screening tools, especially in resource-constrained environments.

Methodology and Contributions

The research proposes a hybrid model combining ViT and Vision Mamba to analyze both spatial and temporal dynamics in eye-tracking data. Key contributions of this study include:

ViT-Mamba Model Development: This innovative model captures spatial fixation maps and long-range temporal attention effectively. The integration of ViT allows for superior spatial feature extraction through self-attention mechanisms, while the Vision Mamba model adeptly handles sequential gaze dynamics, modeling intricate time-dependent eye movement patterns crucial for accurate diagnosis.
Saliency4ASD Dataset Enhancement: By enriching the dataset with a broader range of gaze samples, the model's generalization capabilities are improved, addressing issues of dataset diversity and enhancing diagnostic robustness across varied demographic groups.
Advanced Feature Extraction: The framework employs sophisticated feature extraction techniques to identify meaningful spatiotemporal gaze patterns linked to ASD, beyond traditional handcrafted features, allowing for comprehensive insight into gaze behavior.
Multimodal Data Fusion: The study employs attention-based fusion strategies, integrating multiple data types such as facial expressions, eye movement, and speech cues, which bolsters the diagnostic accuracy and reliability of the model.
Explainability Mechanisms: Layers of explainable AI are embedded into the model to provide transparency in predictions, crucial for ethical considerations in clinical diagnostics.

Results

The proposed ViT-Mamba framework was evaluated using the Saliency4ASD dataset, revealing compelling performance metrics: 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These results surpass existing methodologies, demonstrating the model's effectiveness in distinguishing individuals with ASD. The high sensitivity and specificity highlight the model's clinical potential, offering the means for early, non-invasive, and scalable ASD screening solutions.

Implications and Future Directions

The proposed framework has significant practical implications in the field of ASD diagnostics, especially for remote or underserved clinical settings where traditional diagnostic resources are limited. The integration of eye-tracking technology with advanced AI models not only enhances diagnostic accuracy but also provides a pathway for automated screening systems that could alleviate the burden on clinical practitioners and facilitate early intervention strategies.

Theoretically, the study advances the application of ViTs and Vision Mamba in the domain of healthcare, showcasing their efficacy in modeling complex behavioral data. Future research could explore the extension of this model to incorporate additional physiological and cognitive data, further refining its diagnostic capabilities.

Future developments in AI are likely to continue evolving to address the intricacies of neurodevelopmental disorders, potentially leading to more personalized and effective healthcare solutions. Researchers might investigate broader applications of deep learning frameworks across different conditions and demographics, ensuring adaptability and inclusiveness in diagnostic practices.

In conclusion, this paper represents a substantial step forward in the use of AI for ASD diagnosis, blending cutting-edge technical approaches with practical applicability, paving the way for transformative changes in how ASD is detected and managed globally.

Markdown Report Issue