The N-Body Problem: Parallel Execution from Single-Person Egocentric Video

Published 12 Dec 2025 in cs.CV | (2512.11393v1)

Abstract: Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-LLM (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel formulation of the N-Body Problem to synthesize multi-agent parallel executions from egocentric video while optimizing speed-up and task coverage.
The paper employs a structured prompting framework for vision-language models, achieving up to 91.3% action coverage and significantly reducing collision and conflict rates.
The paper demonstrates practical applications in collaborative robotics and digital twins by enabling rapid prototyping of feasible execution plans from unscripted video data.

The N-Body Problem: Multi-Agent Parallel Execution from Egocentric Video

Problem Formulation and Motivation

This paper formalizes the N-Body Problem, which is defined as synthesizing hypothetical multi-agent parallel executions from a single-person egocentric video. The canonical objective is to maximize speed-up and task coverage when “dividing” the original set of tasks among $N$ agents; however, purely greedy or naive segment assignment inevitably violates critical real-world constraints, including physical spatial collisions, simultaneous object usage, and causal ordering. The central insight is that unscripted egocentric videos encompass latent opportunities for concurrency, but robust parallelization requires holistic spatiotemporal, semantic, and causal reasoning—well beyond simple resource-constrained scheduling.

Figure 1: Egocentric input, predicted 2-body and 3-body parallel executions, with significant speed-up and spatial coverage visualized in 3D.

Methodology and Architectural Innovations

To tackle the N-Body Problem, the authors propose a suite of evaluation metrics for both performance (speed-up, coverage) and feasibility (collision, object conflict, causal correctness). The input is parsed into non-overlapping assignable segments, which must be allocated to agents such that spatial constraints (collision-free occupancy), object constraints (exclusive instance usage), and causality constraints (temporal dependencies among actions) are respected.

A key contribution is the development of a structured prompting framework for Vision-LLMs (VLMs), specifically Gemini 2.5 Pro, instructing the model to reason over 3D spatial environments, track objects, and maintain causal temporal structure. The spatial prompt divides the physical kitchen plane into zones and provides explicit guidance on zone occupancy durations, mitigating collision rates. Objective-driven prompting further instructs Gemini to optimize for speed-up and coverage while minimizing constraint violations. Notably, no model weights are fine-tuned; improvements derive solely from prompt engineering.

Experimental Results and Quantitative Analysis

Extensive experiments are conducted on 100 long-form egocentric videos (EPIC-KITCHENS and HD-EPIC), leveraging ground-truth annotations solely for evaluation. Major findings include:

For $N=2$ , the structured Gemini prompt achieves up to 91.3% action coverage, 1.6× speed-up, and reduces spatial collision rate to 7.7% (from 25.9% for naive baselines).
Constraint violation reductions are substantial: object conflict rate down 45%, causality violation rate dropped 55%, and action coverage up 45% compared to the Gemini base prompt.
Scaling to $N=3$ yields further speed-up (2.3×), with acceptable increases in constraint violations.
Figure 2: Distribution of evaluation metrics (coverage, collision, etc.) as Gemini2.5 prompts are elaborated over 100 videos.

Naive and heuristic scheduling baselines (half-half split, HEFT) lead to unacceptable rates of collision and constraint violations, demonstrating the ill-posedness of simple temporal division. Open-weight VLMs (Qwen2.5-VL-72B) are unable to solve the problem reliably and fail on a significant subset of inputs.

Qualitative and Analytical Insights

Qualitative results reveal interpretable parallel task allocations: typically, one agent is assigned stationary tasks in spatial hotspots (e.g., cooking), while others traverse to perform preparatory, cleaning, or fetching activities. Analysis shows one agent often handles a majority of cooking or cleaning, and walking distance tends to concentrate on mobile roles—indicative of emergent collaborative behavior learned via prompt structure.

Figure 3: Qualitative task splits for $N=2$ / $N=3$ agents; distinct roles and division of labor emerge naturally.

Figure 4: Analysis of agent cooking time, prep-step role assignment, and walking distance splits, highlighting emergent collaborative behaviors.

Implications and Theoretical/Practical Significance

The formulated N-Body Problem pushes video understanding from passive temporal segmentation toward generative simulation of plausible multi-agent collaboration, grounded in explicit physical and logical constraints. The findings quantitatively demonstrate that advanced VLMs—when carefully prompted—can synthesize feasible parallel execution plans from single-user demonstrations, opening new horizons for generalizing from human activity traces.

Practically, this enables rapid prototyping for collaborative robotics, digital twins, and task planning agents in environments where annotated multi-agent data is scarce. The methodology is extensible to other structured environments (factory, healthcare, smart home) and robust to long-term, unscripted activities. Theoretically, the work suggests that prompt engineering, not just model training, is a critical bottleneck for unlocking spatial-temporal reasoning capabilities in multimodal foundation models.

Limitations and Future Directions

While the approach yields strong empirical improvements, some challenges remain: precise temporal alignment in multi-agent cooking (e.g., waiting for food), discovery of implicit procedural dependencies, robustness to higher values of $N$ , and intrinsic spatial reasoning in VLMs without spatial prompts. Future research could focus on integrating explicit action graphs, unsupervised discovery of causal structure, or architectural enhancements that directly model agent-object-environment interactions without textual scaffolding. Evaluation techniques for real multi-agent execution (e.g., robotic agents) will be necessary for translation to embodied AI applications.

Figure 5: Speed-up versus collision rate trade-off visualized over diverse methods and spatial prompt variants.

Conclusion

This paper delivers a rigorous framework and empirical validation for the N-Body Problem, showing that with appropriate prompting, current VLMs can infer feasible, efficient multi-agent parallel executions from single-person egocentric video. The combination of new evaluation metrics, constraint-driven prompt engineering, and multi-modal reasoning sets a robust foundation for future research in agentic AI, collaborative planning, and human-centric video understanding (2512.11393).

Markdown Report Issue