AssistGPT: Teaching AI to Plan, Act, and Learn from Visual Tasks

This presentation introduces AssistGPT, a multi-modal AI assistant that addresses complex visual reasoning tasks through an innovative PEIL framework: Plan, Execute, Inspect, Learn. Unlike traditional approaches that convert visual content to text, AssistGPT uses natural language planning to orchestrate external models and APIs, managing visual information dynamically while learning from its own execution history. The system achieves state-of-the-art results on challenging benchmarks like A-OKVQA and NExT-QA, demonstrating how autonomous exploration and memory management enable AI to tackle intricate visual queries that go far beyond standard benchmark scenarios.
Script
Visual reasoning tasks present a paradox: they require both flexible decomposition of complex queries and dynamic handling of images and videos, yet most AI systems force visual content through a rigid text conversion pipeline that loses critical information before reasoning even begins.
Traditional multi-modal systems convert visual content to text using captioners and object detectors, discarding the very information needed for deeper reasoning. The authors of AssistGPT recognized that complex visual queries demand a different architecture entirely.
They designed a system where natural language itself becomes the planning mechanism.
The Planner decides which tools to invoke based on reasoning progress. The Executor runs the code and calls external models. But here's the crucial innovation: the Inspector maintains a memory of visual inputs and intermediate results, delivering exactly the right information to each tool at exactly the right moment, while the Learner explores and catalogs what works.
This diagram reveals why the architecture matters. Unlike code-based planning that rigidly sequences API calls, or pure language planning that loses visual grounding, PEIL maintains a live dialogue between planning, execution, and inspection. The system doesn't just run a predetermined script. It adapts its reasoning path based on what it discovers in the visual content, creating a feedback loop that mirrors how humans approach complex visual problems.
On established benchmarks, AssistGPT doesn't just match existing systems, it surpasses them. More remarkably, the system tackles intricate visual queries that standard benchmarks never anticipated, precisely because it can plan, learn, and adapt rather than execute a fixed transformation pipeline.
AssistGPT demonstrates that the future of multi-modal AI lies not in converting everything to text, but in teaching systems to orchestrate reasoning dynamically, learning from their own visual explorations. Visit EmergentMind.com to learn more and create your own research videos.