AOR-Net: Multimodal Driver Action Recognition
- The paper introduces the AOR-Net framework that leverages multi-level reasoning and dynamic feature fusion through its novel Mixture-of-Thoughts module.
- It integrates a Chain-of-Action Prompting module with a textual prototype bank to align driver, object, and relation features for robust action recognition.
- Empirical evaluations on the DAOS dataset reveal significant improvements in Top-1 and Mean-1 accuracy, especially in distinguishing similar in-cabin actions.
The Action-Object-Relation Network (AOR-Net) is a driver action recognition framework designed for multimodal, multi-view in-cabin behavior monitoring. AOR-Net models the logical dependencies among driver actions, object interactions, and their relationships, addressing the challenge of distinguishing visually similar driver actions by leveraging contextual object cues. The architecture integrates multi-level reasoning, prompt-based feature refinement, and dynamic selection of relevant features via a novel Mixture-of-Thoughts module. Evaluated on the DAOS dataset, AOR-Net demonstrates superior action recognition performance compared to prior methods, notably in object-rich and object-scarce scenarios (Li et al., 17 Jan 2026).
1. Architectural Overview
AOR-Net is architected atop an Open-VCLIP backbone (CLIP using ViT-B/32, pre-trained on K400), extended by three key modules:
- Chain-of-Action Prompting (CoA) Module: Implements sequential multi-level reasoning in three stages:
- Action-level reasoning (global and spatial tokens extraction)
- Object-level reasoning (region-of-interest features for object tokens)
- Relation-level reasoning (pairwise human–object token interactions)
- Textual Prototype Bank: Stores CLIP-encoded descriptors for action, object, and relation categories, generated with GPT-4o prompting.
- Mixture-of-Thoughts (MoT) Module: Dynamically aligns, weights, and fuses multi-level visual and textual features.
The data processing pipeline consists of: video encodings via CLIP, iterative refinement via CoA, cross-modal feature alignment with the textual prototype bank, dynamic feature fusion by MoT, and classification via fully connected layers.
2. Mathematical Formulation and Module Dynamics
Given a video clip of frames, detected objects per frame (up to ), and ground-truth action labels, the model operates entirely in a -dimensional embedding space (). Let denote the global class token; the spatial tokens from CLIP; , the object and relation token embeddings.
- Action-level: Extract ; no further transformation.
- Object-level:
- Relation-level:
- Compose pairs: , for each human–object pair.
- Refine via cross-attention: .
All modules share the textual prototype bank, which encodes actions (), objects (), and relations () via CLIP, using contextually enriched GPT-4o prompts.
3. Chain-of-Action Prompting and Textual Prototype Bank
The Chain-of-Action prompting module enforces a hierarchical decomposition—progressing from global scene and action features, to specific object embeddings, to explicit modeling of human–object relations. The process is tightly coupled with the textual prototype bank:
- Prototype Generation: GPT-4o generates cabin-aware descriptions for each action (), each object (), and each action–object relation; these are encoded by CLIP’s text encoder and stored as , , .
- Cross-modal Alignment: Each new set of visual tokens (action, object, relation) is aligned against its textual prototypes via similarity matrices and Gumbel-Softmax–derived differentiable one-hot selection.
Algorithmic procedures ensure cabin-context sensitivity at all levels of reasoning, leveraging both language and visual information for robust representations.
4. Mixture-of-Thoughts Module
The Mixture-of-Thoughts (MoT) module enhances feature salience and robustness by dynamically weighting and fusing features across reasoning levels:
- Alignment: Compute for .
- Differentiable One-Hot Selection: Gumbel-Softmax yields hard, but gradient-friendly, alignment masks.
- Feature Construction: Update feature vectors with aligned textual prototypes: .
- Dynamic Weight Generation: All features are flattened and concatenated, then passed through an MLP/softmax to yield .
- Final Feature Synthesis: All outputs are fused: .
This process enables adaptive attention to the most informative action, object, or relational patterns, improving discriminative power under varying object-scene contexts. Empirical ablations confirm Gumbel-Softmax temperature () outperforms softmax in feature selection.
5. Loss Function and Optimization
AOR-Net is trained with standard cross-entropy loss over video clips: where is the ground-truth action label, and is the model softmax prediction. Training employs the AdamW optimizer over 30 epochs, with learning-rate schedule at epochs .
6. Architectural Hyperparameters and Ablations
Key architectural parameters:
- CLIP ViT-B/32 backbone,
- video frames, spatial tokens per frame
- Maximum 6 objects per clip
- Relation encoder: 5-layer MLP, hidden width 1024
- 12 attention heads (ViT-B standard)
- Gumbel-Softmax temperature
Ablation studies identify as the optimal number of objects per clip. Relation encoder performance peaks with 5 layers and 512 hidden width ( Mean-1). Addition of CoA alone provides a Top-1 gain, increasing to with both CoA and MoT modules.
7. Experimental Evaluation on DAOS
AOR-Net is evaluated on the DAOS dataset, which comprises 9,787 clips (74 hours), 36 fine-grained and 12 coarse action classes, 15 object classes, and 2.58 million object boxes. Data includes RGB, IR, and depth modalities from four synchronized camera views, with splits by driver (32 train, 6 validation, 6 test).
Performance metrics include Top-1, Top-5, and Mean-1. On single-modality (RGB, fine-grained), AOR-Net achieves Top-1 and Mean-1, versus Open-VCLIP’s /. With multimodal input (RGB+IR+Depth, coarse-grained), Top-1 and Mean-1 are and (/ for Open-VCLIP). These results capture the benefits of multi-level reasoning, dynamic feature fusion, and focus on task-relevant object relations. Improvements are consistent across object-rich and object-scarce test conditions (Li et al., 17 Jan 2026).