Papers
Topics
Authors
Recent
Search
2000 character limit reached

Video Action Transformer Network (VATN)

Updated 9 February 2026
  • The paper introduces VATN, which integrates Transformer-based attention with a two-stage Faster R-CNN pipeline to aggregate features from spatiotemporal context.
  • VATN employs an I3D trunk for feature extraction and a high-resolution Transformer head that uses multi-head attention for contextual reasoning and precise action localization.
  • Experiments on the AVA benchmark show that VATN achieves 24.93 mAP, outperforming prior models and demonstrating effective emergent tracking and focus on key human regions.

The Video Action Transformer Network (VATN) is a model for spatiotemporal human action recognition and localization in video, integrating the Transformer attention mechanism with region-based video understanding. Developed as an Action Transformer, VATN adapts Transformer architectures to aggregate features from spatiotemporal context specifically centered around person proposals, enabling recognition and localization using only raw RGB video frames and supervised by bounding boxes and class labels. VATN advances the state-of-the-art on the Atomic Visual Actions (AVA) benchmark with significant gains over previous models using a Faster R-CNN-style pipeline (Girdhar et al., 2018).

1. Model Architecture and Overall Pipeline

VATN employs a two-stage Faster R-CNN-style pipeline for temporal action localization in video:

  1. Trunk Network: The input is a TT-frame RGB clip of spatial resolution H×WH\times W (T=64T=64, H=W=400H=W=400), centered on a key-frame. Feature extraction uses the I3D (Inflated 3D ConvNet) trunk up to the Mixed_4f block, pretrained on Kinetics-400. The output feature tensor has reduced temporal and spatial resolution:

T=T/4,H=H/16,W=W/16,Dtrunk1024.T' = T/4,\quad H' = H/16,\quad W' = W/16,\quad D_{\text{trunk}} \approx 1024.

The central temporal slice (t=T/2t=T'/2) is input to the Region Proposal Network (RPN).

  1. Region Proposal Network (RPN): The RPN identifies RR person proposals in the central frame, ranked by objectness; at full scale, R=300R=300 is used.
  2. Head Networks:
    • I3D-Head (Baseline): Proposals are extended across time to form tubes, and spatiotemporal RoIPooling yields T×7×7T'\times 7\times 7 features. These are processed by the remaining I3D layers (Mixed_5a–5c), followed by linear classification and bounding-box regression.
    • Action Transformer Head (VATN): Proposals use only the central frame for each query, with the full (T,H,W)(T', H', W') feature volume providing the keys and values for the Transformer. Multi-head, multi-layer attention aggregates contextual information for human action classification and localization.
  3. Outputs: For each proposal, the network produces multi-label classification scores (via sigmoid cross-entropy) for C=80C=80 AVA classes, alongside class-agnostic bounding-box regression (smooth-L1).

2. Transformer-Based Attention Mechanism

The core of the VATN head is the Transformer attention block, designed for contextual reasoning in video. For each proposal rr:

  • Input Variables:
    • Query: Q(r)RDQ^{(r)}\in \mathbb{R}^D
    • Keys: KRT×H×W×DK\in\mathbb{R}^{T'\times H'\times W'\times D}
    • Values: VRT×H×W×DV\in\mathbb{R}^{T'\times H'\times W'\times D}
  • Attention Computation:

ax,y,t(r)=Q(r)(Kx,y,t)TDa^{(r)}_{x,y,t} = \frac{Q^{(r)}(K_{x,y,t})^T}{\sqrt{D}}

αx,y,t(r)=Softmaxx,y,t(a(r))\alpha^{(r)}_{x,y,t} = \text{Softmax}_{x,y,t}(a^{(r)})

A(r)=x,y,tαx,y,t(r)Vx,y,tRDA^{(r)} = \sum_{x,y,t} \alpha^{(r)}_{x,y,t} V_{x,y,t}\in\mathbb{R}^D

Multi-head attention utilizes learned projections WhQ,WhK,WhVRD×dkW^Q_h, W^K_h, W^V_h\in\mathbb{R}^{D\times d_k}, WORHdk×DW^O\in\mathbb{R}^{H d_k\times D}:

headh=Attention(QWhQ,KWhK,VWhV)Rdk\text{head}_h = \text{Attention}(QW^Q_h, K W^K_h, V W^V_h)\in\mathbb{R}^{d_k}

MultiHead(Q,K,V)=Concat(head1,,headH)WORD\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_H) W^O \in\mathbb{R}^{D}

  • Layering: Each Transformer unit applies multi-head attention, followed by add & layer normalization, a position-wise 2-layer MLP with ReLU, dropout, and normalization:

Q(r)=LayerNorm(Q(r)+Dropout(MultiHead(Q(r),K,V)))Q^{(r)\prime} = \text{LayerNorm}(Q^{(r)} + \text{Dropout}(\text{MultiHead}(Q^{(r)}, K, V)))

FFN(x)=(max(0,xW1+b1))W2+b2\text{FFN}(x) = (\max(0, xW_1+b_1))W_2+b_2

Q(r)=LayerNorm(Q(r)+Dropout(FFN(Q(r))))Q^{(r)\prime\prime} = \text{LayerNorm}(Q^{(r)\prime} + \text{Dropout}(\text{FFN}(Q^{(r)\prime})))

Stacking LL such layers with HH heads enriches the query vector for subsequent prediction.

3. High-Resolution, Class-Agnostic Query Encoding

VATN's query representation for each proposal is constructed via a HighRes Query Preprocessor (QPr):

  1. Extract a 7×7×Dtrunk7\times 7\times D_{\rm trunk} RoIPooled feature from the central frame.
  2. Apply a 1×11\times1 convolution to reduce depth to CqC_q channels.
  3. Flatten the 7×77\times 7 spatial grid to a vector of length 49Cq49 C_q.
  4. Use a learned linear layer to obtain a DD-dimensional query vector for the Transformer.

Each Q(r)Q^{(r)} remains class-agnostic, representing the individual only. The model is compelled, via classification supervision alone, to learn body parts, track individuals, and focus on semantically important regions (hands, faces, and objects) across space-time, without instance- or part-level supervision.

4. Spatiotemporal Positional Encoding

To mitigate the permutation invariance of the Transformer, VATN incorporates explicit position information:

For each feature cell (x,y,t)(x, y, t), the system computes normalized coordinates:

pxy=(xH12,yW12),qt=t(T/2)Tp_{xy} = \left(\frac{x}{H'} - \frac{1}{2}, \frac{y}{W'} - \frac{1}{2}\right),\quad q_t = \frac{t-(T'/2)}{T'}

Spatial and temporal positions are separately embedded via 2-layer MLPs:

spatial(pxy)Rdp,temporal(qt)Rdt\ell^{\sf spatial}(p_{xy}) \in \mathbb{R}^{d_p},\qquad \ell^{\sf temporal}(q_t)\in\mathbb{R}^{d_t}

The concatenated positional embedding Lx,y,tL_{x,y,t} is appended to each feature cell, giving:

Fx,y,t[Fx,y,t;Lx,y,t]RDtrunk+dp+dtF_{x,y,t} \leftarrow [F_{x,y,t}; L_{x,y,t}]\in\mathbb{R}^{D_{trunk}+d_p+d_t}

Keys and values for the Transformer are derived via linear projection from this augmented feature map, and queries inherit spatial cues accordingly.

5. Loss Formulation

VATN uses the following multi-task loss for each proposal rr:

  • Multi-label Classification:

Lcls=c=1C[yclogσ(sc)+(1yc)log(1σ(sc))]\mathcal{L}_{\rm cls} = -\sum_{c=1}^{C} \left[ y_c \log \sigma(s_c) + (1-y_c)\log (1-\sigma(s_c)) \right]

where scs_c are logits, yc{0,1}y_c\in\{0,1\}, and σ\sigma is sigmoid.

  • Bounding-Box Regression:

Lreg=i{x,y,w,h}SmoothL1(titi)\mathcal{L}_{\rm reg} = \sum_{i\in\{x,y,w,h\}} \text{Smooth}_{L_1}(t_i-t^*_i)

Only positive proposals contribute to regression loss.

  • Combined Loss:

L=1Nr=1N[Lcls(r)+λLreg(r)]\mathcal{L} = \frac{1}{N} \sum_{r=1}^{N} [\mathcal{L}_{\rm cls}^{(r)} + \lambda \mathcal{L}_{\rm reg}^{(r)} ]

with λ=1\lambda=1 in practice.

6. Training Procedures and Hyperparameters

  • Initialization: I3D trunk pre-trained on Kinetics-400; all new layers initialized randomly. BatchNorm in I3D is frozen.
  • Data Augmentation: Random horizontal flip and spatial crop to 400×400400\times400 to counteract overfitting.
  • Optimization: Synchronized SGD over 10 GPUs (effective batch size 30), initial learning rate 0.01 (warmup to 0.1, then cosine annealing over 500k iterations). Some experiments use shorter schedules (300k) with ground-truth boxes.
  • Transformer Configuration: D=128D=128, dropout rate 0.3, typically 2 heads ×\times 3 layers.
  • Proposals: R=300R=300 (full-scale), R=64R=64 for ablation.

7. Performance and Ablation Results

Quantitative Outcomes on AVA (v2.1)

Head/Setting Action Classification mAP Localization mAP (IoU ≥ 0.5)
I3D Head (GT boxes, 64 prop) 23.4 92.9
Transformer LowRes 29.1 77.5
Transformer HighRes 27.6 87.7
I3D Head (RPN, 300 prop) 20.5
Transformer HighRes (RPN) 24.4
Combined (reg/cls) 24.9

Test set performance: VATN achieves 24.93 mAP (test), outperforming prior best ensemble-free RGB+flow results (21.08 mAP) by 3.8 points.

Ablation Studies

  • Regression: Switching from class-agnostic to class-specific regression reduces mAP (21.3 → 19.2).
  • Data Augmentation: Removing augmentation lowers mAP (21.3 → 16.6).
  • Pretraining: Training from scratch (no Kinetics) yields 19.1 mAP (vs. 21.3 with pretraining).
  • Depth/Width Trade-off (GT boxes): Best results are with 6 layers × 2 heads (29.1 mAP).

Emergent Tracking and Context

Without explicit supervision, the action transformer head learns to:

  • Track individuals over frames by clustering body pixel attentions.
  • Distinguish between nearby people as instance-specific keys emerge.
  • Emphasize hands, faces, and manipulated objects in its attention, supporting fine-grained action classification.

These properties emerge from repeated attention of each query over the full spatiotemporal feature volume, combined with only final action classification supervision; tracking and body-part segmentation are not directly supervised (Girdhar et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video Action Transformer Network (VATN).