Learning from Streaming Video with Orthogonal Gradients

Published 2 Apr 2025 in cs.CV | (2504.01961v1)

Abstract: We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer -- we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.

Abstract PDF Upgrade to Chat

Summary

Learning from Streaming Video with Orthogonal Gradients

This paper introduces a novel approach to representation learning from continuous video streams using orthogonal gradients. Traditional methods for video learning often rely on dividing videos into clips that are shuffled during training to satisfy the IID (independently and identically distributed) assumption, a staple of current training paradigms. However, when videos are only available as continuous input streams, these IID assumptions are violated, resulting in degraded performance. The authors address this challenge by introducing a geometric modification to optimizers, thereby enabling learning with orthogonal gradients to decorrelate batches during training.

Key Contributions

Performance Analysis in Sequential Learning: The paper begins by identifying the substantial drop in performance when transitioning from shuffled to sequential training on multiple tasks, including the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and future video prediction.
Orthogonal Gradient Optimization: The authors propose a modification that can be applied to any optimizer — demonstrated with Stochastic Gradient Descent (SGD) and AdamW. This modification aims to decorrelate gradients between consecutive training steps by utilizing orthogonal components, mitigating substantial performance drops in scenarios where sequences are temporally correlated.
Empirical Evidence Across Scenarios: Through extensive evaluations across three video learning scenarios — learning from a single video, multi-video datasets, and future frame prediction — the orthogonal optimizer consistently outperforms the baseline AdamW optimizer, highlighting its effectiveness in these sequential learning tasks.
Insights from Various Experiments: The paper extensively analyzes the effect of batch processing strategies and the impact of optimizing orthogonal components. By experimenting with different workloads, the paper provides critical insights into how orthogonal gradients can mitigate the adverse effects of sequence correlations found in video streams.

Numerical Results

The orthogonal optimizer significantly improved learning from video streams. For instance, in the DoRA setting from a single video leading to downstream ImageNet evaluation, the orthogonal optimizer significantly enhanced performance over baseline methods (AdamW failed the training while Orthogonal-AdamW managed over 50% improvement). In both VideoMAE and future prediction tasks, the orthogonal optimizer outperformed standard optimization techniques, indicating substantial benefits when processing streamed videos in various task settings.

Implications and Future Developments

The implications of this research are profound both in practical and theoretical realms. Practically, the ability to learn efficiently from streaming data is crucial for applications in robotics, autonomous vehicles, and any domain where real-time data processing from video streams is required. Theoretically, the introduced concept challenges the traditional reliance on IID assumptions and illuminates new paradigms for sequential and continual learning.

Future research directions may involve integrating orthogonal gradient optimization with more advanced learning paradigms and delving deeper into its practical applications, such as privacy-preserving on-device learning which negates the need for data storage and shuffling. Additionally, exploring the combination with advanced video representation models, neural architecture search, or domain adaptation techniques could further enhance its utility and performance in diverse settings.

In summary, this paper lays foundational work promoting orthogonal gradients as a promising solution for the unique challenges posed by continuous video streams. This approach provides a new outlook on how future video models, mimicking human visual perception, can be conceived and developed.