Are Vision Transformers More Data Hungry Than Newborn Visual Systems?

Published 5 Dec 2023 in cs.CV, cs.AI, cs.LG, and cs.NE | (2312.02843v1)

Abstract: Vision transformers (ViTs) are top performing models on many computer vision benchmarks and can accurately predict human behavior on object recognition tasks. However, researchers question the value of using ViTs as models of biological learning because ViTs are thought to be more data hungry than brains, with ViTs requiring more training data to reach similar levels of performance. To test this assumption, we directly compared the learning abilities of ViTs and animals, by performing parallel controlled rearing experiments on ViTs and newborn chicks. We first raised chicks in impoverished visual environments containing a single object, then simulated the training data available in those environments by building virtual animal chambers in a video game engine. We recorded the first-person images acquired by agents moving through the virtual chambers and used those images to train self supervised ViTs that leverage time as a teaching signal, akin to biological visual systems. When ViTs were trained through the eyes of newborn chicks, the ViTs solved the same view invariant object recognition tasks as the chicks. Thus, ViTs were not more data hungry than newborn visual systems: both learned view invariant object representations in impoverished visual environments. The flexible and generic attention based learning mechanism in ViTs combined with the embodied data streams available to newborn animals appears sufficient to drive the development of animal-like object recognition.

Abstract PDF HTML Upgrade to Chat

Citations (10)

View on Semantic Scholar

Summary

The paper demonstrates that vision transformers trained with time-based visual streams, replicating conditions of newborn chicks, can develop robust object recognition.
The paper compares architectures like ViT-CoT and VideoMAE, revealing that larger models are not necessarily more data hungry in embodied visual tasks.
The paper challenges the perception of vision transformers as data greedy, suggesting potential for AI systems to learn in a biologically plausible manner.

Vision transformers (ViTs) represent a state-of-the-art approach in the field of computer vision, outperforming several models across various tasks, from object recognition to visual navigation. ViTs, like human and animal brains, possess deep computational similarities and excel in image classification and error patterns. However, there is a concern in the research community pertaining to ViTs' dependency on vast amounts of training data—far more than what biological systems seem to require. This apparent high data requirement has garnered ViTs a reputation of being "data hungry," causing some skepticism regarding their efficiency compared to biological learning processes.

A recent study aims to bridge this gap in understanding by comparing the learning abilities of ViTs and biological systems—specifically newborn chicks. The study utilizes a digital twin approach, where virtual animal chambers were constructed in a game engine, mirroring controlled visual environments used to rear chicks. The first-person images recorded as the avatar moved and interacted within this virtual space were used to train self-supervised ViTs, employing time as a teaching signal. This training setup closely models the learning conditions experienced by newborn chicks.

The vision transformers trained through this method faced the challenge of view-invariant object recognition, tasked to identify objects from various viewpoints, similar to tests conducted with newborn chicks. When trained with these "through-the-eyes-of-a-chick" data streams, ViTs demonstrated capabilities to solve the assigned tasks and develop animal-like object recognition. Both newborn chicks and ViTs learned robust visual features from the same impoverished conditions, suggesting that, contrary to the data-hungry critique, transformers might possess an innate efficiency comparable to biological learning systems when exposed to rich temporal visual streams.

The study ventured beyond using a single ViT variant, applying different architectures, such as Vision Transformers with Contrastive Learning through Time (ViT-CoT) and Video Masked Autoencoders (VideoMAE). Each model was trained and evaluated against the visual tasks, yielding results that confirmed their ability to learn efficiently in environments with limited object exposures. Additionally, the research investigated variations by modifying the number of learning images and ViT sizes, finding that larger ViTs were not necessarily more data-hungry than smaller counterparts within an embodied visual context.

The findings extend their implications to the long-standing debate regarding the development of cognition, proposing that a generic learning mechanism, rather than domain-specific systems, may suffice for acquiring high-level visual capacities. This revelation hints at the potential evolution of artificial intelligence closer to biological plausibility, suggesting future AI systems might learn more flexibly and rapidly, akin to animal cognition.

While the results are compelling, the study acknowledges limitations, such as the passive training of models without interactive data collection,an essential aspect of active learning in biological systems. Addressing these constraints could bring forth a new era of "naturally intelligent" AI systems inspired directly by the learning mechanisms inherent in animal cognition.