Generalization of V-JEPA 2 Action Anticipation Beyond Kitchen Environments

Determine how well the V-JEPA 2 model for human action anticipation, evaluated in this paper on the Epic-Kitchens-100 benchmark, generalizes to environments outside kitchens by assessing performance on datasets drawn from non-kitchen domains and reporting comparable metrics (e.g., mean-class recall-at-5 for verb, noun, and action).

Background

The Epic-Kitchens-100 (EK100) benchmark consists of egocentric videos recorded in kitchen environments and is used in the paper to evaluate V-JEPA 2 on action anticipation with a 1-second horizon. V-JEPA 2 achieves strong performance on EK100 across verb, noun, and action recall-at-5 metrics and scales with model size and resolution.

However, EK100’s domain is restricted to kitchens with a fixed vocabulary, raising concerns about how well results translate to other settings. The authors explicitly state that they do not know how well V-JEPA 2 generalizes to environments beyond kitchens, making the model’s cross-domain robustness an unresolved question.

References

Third, the EK100 benchmark is limited to kitchen environments, with a closed well-defined vocabulary, and we do not know how well V-JEPA 2 generalizes to other environments.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning  (2506.09985 - Assran et al., 11 Jun 2025) in Section: Prediction: Probe-based Action Anticipation, Limitations