AOR-Net: Multimodal Driver Action Recognition

Updated 18 February 2026

The paper introduces the AOR-Net framework that leverages multi-level reasoning and dynamic feature fusion through its novel Mixture-of-Thoughts module.
It integrates a Chain-of-Action Prompting module with a textual prototype bank to align driver, object, and relation features for robust action recognition.
Empirical evaluations on the DAOS dataset reveal significant improvements in Top-1 and Mean-1 accuracy, especially in distinguishing similar in-cabin actions.

The Action-Object-Relation Network (AOR-Net) is a driver action recognition framework designed for multimodal, multi-view in-cabin behavior monitoring. AOR-Net models the logical dependencies among driver actions, object interactions, and their relationships, addressing the challenge of distinguishing visually similar driver actions by leveraging contextual object cues. The architecture integrates multi-level reasoning, prompt-based feature refinement, and dynamic selection of relevant features via a novel Mixture-of-Thoughts module. Evaluated on the DAOS dataset, AOR-Net demonstrates superior action recognition performance compared to prior methods, notably in object-rich and object-scarce scenarios (Li et al., 17 Jan 2026).

1. Architectural Overview

AOR-Net is architected atop an Open-VCLIP backbone (CLIP using ViT-B/32, pre-trained on K400), extended by three key modules:

Chain-of-Action Prompting (CoA) Module: Implements sequential multi-level reasoning in three stages:
- Action-level reasoning (global and spatial tokens extraction)
- Object-level reasoning (region-of-interest features for object tokens)
- Relation-level reasoning (pairwise human–object token interactions)
Textual Prototype Bank: Stores CLIP-encoded descriptors for action, object, and relation categories, generated with GPT-4o prompting.
Mixture-of-Thoughts (MoT) Module: Dynamically aligns, weights, and fuses multi-level visual and textual features.

The data processing pipeline consists of: video encodings via CLIP, iterative refinement via CoA, cross-modal feature alignment with the textual prototype bank, dynamic feature fusion by MoT, and classification via fully connected layers.

2. Mathematical Formulation and Module Dynamics

Given a video clip $\mathbf{x}$ of $T$ frames, detected objects per frame (up to $O$ ), and ground-truth action labels, the model operates entirely in a $d$ -dimensional embedding space ( $d=768$ ). Let $\mathbf{V}_C$ denote the global class token; $\mathbf{V}_A$ the spatial tokens from CLIP; $\mathbf{V}_O$ , $\mathbf{V}_R$ the object and relation token embeddings.

Action-level: Extract $[\mathbf{V}_C;\,\mathbf{V}_A] = f_{\theta_V}(\mathbf{x})$ ; no further transformation.
Object-level:
- Apply RoIAlign to spatial tokens guided by bounding boxes: $\mathbf{v}_o^t = \mathrm{RoIAlign}(\mathbf{V}_A^t, b_o^t)$ .
- Project and pool: $\mathbf{V}_O = \max_t(\mathrm{MLP}_2(\mathbf{v}_o^t))$ .
- MHSA updates action/object tokens jointly: $[\mathbf{V}_A';\,\mathbf{V}_O'] = \mathrm{MHSA}([\mathbf{V}_A^{(0)};\mathbf{V}_O])$ .
Relation-level:
- Compose pairs: $\mathbf{r}_{ho} = \mathrm{MLP}_5([\mathbf{v}_h; \mathbf{v}_o])$ , for each human–object pair.
- Refine via cross-attention: $\mathbf{V}_R' = \mathrm{MHCA}(Q=\mathbf{V}_R, K=\mathbf{V}_O', V=\mathbf{V}_O')$ .

All modules share the textual prototype bank, which encodes actions ( $\mathbf{T}_A$ ), objects ( $\mathbf{T}_O$ ), and relations ( $\mathbf{T}_R$ ) via CLIP, using contextually enriched GPT-4o prompts.

3. Chain-of-Action Prompting and Textual Prototype Bank

The Chain-of-Action prompting module enforces a hierarchical decomposition—progressing from global scene and action features, to specific object embeddings, to explicit modeling of human–object relations. The process is tightly coupled with the textual prototype bank:

Prototype Generation: GPT-4o generates cabin-aware descriptions for each action ( $l_A$ ), each object ( $l_O$ ), and each action–object relation; these are encoded by CLIP’s text encoder and stored as $\mathbf{T}_A$ , $\mathbf{T}_O$ , $\mathbf{T}_R$ .
Cross-modal Alignment: Each new set of visual tokens (action, object, relation) is aligned against its textual prototypes via similarity matrices and Gumbel-Softmax–derived differentiable one-hot selection.

Algorithmic procedures ensure cabin-context sensitivity at all levels of reasoning, leveraging both language and visual information for robust representations.

4. Mixture-of-Thoughts Module

The Mixture-of-Thoughts (MoT) module enhances feature salience and robustness by dynamically weighting and fusing features across reasoning levels:

Alignment: Compute $\mathbf{M}_l = \mathbf{V}_l' \mathbf{T}_l^\top$ for $l\in\{A,O,R\}$ .
Differentiable One-Hot Selection: Gumbel-Softmax yields hard, but gradient-friendly, alignment masks.
Feature Construction: Update feature vectors with aligned textual prototypes: $\mathbf{F}_l = \mathbf{V}_l' + \mathrm{MLP}(\hat{\mathbf{M}_l} \mathbf{T}_l)$ .
Dynamic Weight Generation: All features are flattened and concatenated, then passed through an MLP/softmax to yield $\mathbf{W} = [W_A; W_O; W_R]$ .
Final Feature Synthesis: All outputs are fused: $\mathbf{A}_{\rm final} = W_A \mathbf{F}_A + \sum_i W_{O,i}\mathbf{F}_{O,i} + \sum_j W_{R,j} \mathbf{F}_{R,j}$ .

This process enables adaptive attention to the most informative action, object, or relational patterns, improving discriminative power under varying object-scene contexts. Empirical ablations confirm Gumbel-Softmax temperature ( $T=5$ ) outperforms softmax in feature selection.

5. Loss Function and Optimization

AOR-Net is trained with standard cross-entropy loss over $N$ video clips: $\mathcal{L} = -\frac{1}{N} \sum_{n=1}^N y_n \log \hat{y}_n,$ where $y_n$ is the ground-truth action label, and $\hat{y}_n$ is the model softmax prediction. Training employs the AdamW optimizer over 30 epochs, with learning-rate schedule $[10^{-4}, 10^{-5}, 10^{-6}]$ at epochs $[0, 15, 25]$ .

6. Architectural Hyperparameters and Ablations

Key architectural parameters:

CLIP ViT-B/32 backbone, $d=768$
$T=8$ video frames, $16\times16$ spatial tokens per frame
Maximum 6 objects per clip
Relation encoder: 5-layer MLP, hidden width 1024
12 attention heads (ViT-B standard)
Gumbel-Softmax temperature $T \approx 5$

Ablation studies identify $O=6$ as the optimal number of objects per clip. Relation encoder performance peaks with 5 layers and 512 hidden width ( $45.39\%$ Mean-1). Addition of CoA alone provides a $+0.93\%$ Top-1 gain, increasing to $+1.75\%$ with both CoA and MoT modules.

7. Experimental Evaluation on DAOS

AOR-Net is evaluated on the DAOS dataset, which comprises 9,787 clips (74 hours), 36 fine-grained and 12 coarse action classes, 15 object classes, and 2.58 million object boxes. Data includes RGB, IR, and depth modalities from four synchronized camera views, with splits by driver (32 train, 6 validation, 6 test).

Performance metrics include Top-1, Top-5, and Mean-1. On single-modality (RGB, fine-grained), AOR-Net achieves $55.08\%$ Top-1 and $44.97\%$ Mean-1, versus Open-VCLIP’s $53.30\%$ / $42.41\%$ . With multimodal input (RGB+IR+Depth, coarse-grained), Top-1 and Mean-1 are $70.25\%$ and $64.41\%$ ( $67.09\%$ / $63.12\%$ for Open-VCLIP). These results capture the benefits of multi-level reasoning, dynamic feature fusion, and focus on task-relevant object relations. Improvements are consistent across object-rich and object-scarce test conditions (Li et al., 17 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

DAOS: A Multimodal In-cabin Behavior Monitoring with Driver Action-Object Synergy Dataset (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AOR-Net.