Papers
Topics
Authors
Recent
Search
2000 character limit reached

AgentIF-OneDay Benchmark

Updated 4 February 2026
  • AgentIF-OneDay is a comprehensive benchmark that evaluates AI agents’ instruction-following and generalization skills in authentic, multi-step daily tasks.
  • It integrates methodologies like file-based reasoning, attachment parsing, and dynamic scheduling to simulate professional, personal, and study scenarios.
  • The evaluation employs instance-level scoring and metrics such as success rate and factuality to highlight challenges in state consistency and latent rule extraction.

AgentIF-OneDay is a comprehensive benchmark for evaluating the instruction-following and generalization capabilities of AI agents operating in realistic, day-to-day scenarios involving diverse user tasks, dynamic environments, and multimodal information sources. Developed as a response to the limitations of traditional benchmarks that emphasize artificially complex or specialized tasks, AgentIF-OneDay uniquely focuses on the capacity of general-purpose agents to execute authentic workflows found in daily professional, personal, and study contexts. The benchmark integrates file-based reasoning, attachment parsing, iterative refinement, and dynamic scheduling, providing an incisive testbed for both LLM–driven and reinforcement learning (RL)–augmented agent frameworks (Chen et al., 28 Jan 2026, Fu et al., 13 Jan 2026).

1. Benchmark Motivation and Scope

AgentIF-OneDay is motivated by the observed disconnect between advanced agent performance on synthetic or high-difficulty tasks (e.g., code synthesis, planning, research) and real-world user needs, which often require handling unstructured files, long instructions, and interactive workflows. The benchmark seeks to address two principal gaps:

  • Evaluation of agentic workflows that mirror daily life, work, and learning—including file editing, spreadsheet analysis, and iterative content updates—rather than artificially complex challenge tasks.
  • Measurement of agent robustness in environments characterized by partial observability, dynamic task arrivals, and ambiguous or latent user constraints (Chen et al., 28 Jan 2026, Fu et al., 13 Jan 2026).

AgentIF-OneDay comprises 104 heterogenous tasks encompassing 767 scoring points. Task domains span professional work (59.6%), study/research (23.1%), and personal life (17.3%), and encompass a range of attachment types (.pdf, .png, .xlsx, .csv, .html, .py, etc.), ensuring robust test coverage across multimodal inputs. Attachment requirements span from zero (42 tasks) to as many as ten files. Tasks are distributed across three pragmatic categories reflecting realistic interaction modes.

2. Task Taxonomy and Design Principles

The benchmark encodes three key, user-centric task categories:

  1. Open Workflow Execution: Tasks with explicit, multi-step natural-language instructions (typically exceeding five steps) intended to test agent memory, long-context reasoning, and adherence to explicit workflow constraints. Exemplars require managing dependencies between subtasks, verifying external information, and executing "verify-then-plan" chains.
  2. Latent Instruction Inference: Tasks that provide critical information only in attachments (e.g., PDFs, spreadsheets), requiring the agent to infer rules and constraints. For instance, a pricing optimization task asks the agent to parse a carrier plan PDF, deduce cost formulas, and optimize recommendations under implicit constraints.
  3. Iterative Refinement: Tasks modeling “in-the-loop” collaboration, in which the agent incrementally modifies or extends prior work products (e.g., repositioning tables in an SVG floor plan based on sequential constraint updates reflected in related spreadsheets).

The design reflects several principles:

  • Task Realism: All tasks are constructed from authentic user needs, avoiding artificial or adversarial complexity.
  • Attachment Diversity: Robust evaluation of multimodal perception and reasoning, including vision (e.g., .png floorplans), structured data (e.g., .xlsx, .csv), and typical work documents (.pdf, .ppt).
  • Long-Horizon Consistency: Many tasks require agents to maintain state and context across multiple inference rounds or iterative user feedback.
Task Category % of Tasks Attachment Types
Open Workflow Execution 53.8% .pdf, .xlsx, .html, text, none
Latent Instruction 25.0% .pdf, .ppt, .xlsx, .csv
Iterative Refinement 21.2% .svg, .xlsx, combination (visual + tabular files)

3. Evaluation Pipeline and Metrics

Assessment in AgentIF-OneDay is instance-level and rubric-based, using binary criteria to reflect real-world successful or failed outcomes for each scoring point. Rubrics are partitioned into:

  • Bonus Criteria: Items that reward key agentic behaviors or correct outcomes.
  • Penalty Criteria: Items that flag critical errors (e.g., misunderstandings, hallucinations, file corruption).

For task ii, let Si+S^+_i denote the satisfied bonus criteria, SiS^-_i the triggered penalties, and SimaxS^{max}_i the total achievable bonus. The normalized per-task score is:

si=max(0,Si+Si)Simaxs_i = \frac{\max(0, S^+_i - S^-_i)}{S^{max}_i}

The final benchmark score averages sis_i across all N=104N = 104 tasks:

Scorefinal=1Ni=1Nsi\mathrm{Score_{final}} = \frac{1}{N} \sum_{i=1}^{N} s_i

To scale and align with future agent developments, AgentIF-OneDay employs an LLM-as-judge framework. Gemini-3-Pro-preview is selected for this role due to its strong multimodal reasoning abilities; it applies the rubrics, utilizes web search or HTML rendering when necessary, and grounds fact checks. Agreement between Gemini-3-Pro and human annotators on a held-out sample is 80.1%, outperforming previous Gemini (2.5-Pro, 73.9%) and GPT-5.1 (63.8%) models. Sub-metrics include Instruction-Following, Factuality, Logic/Functionality, and Negative-Constraint Handling. Attachment-conditioned evaluation is explicitly tracked (Chen et al., 28 Jan 2026).

4. Dynamic Workplace Simulation and Exploration

A complementary perspective is provided by EvoEnv, a dynamic, partially observable evaluation environment that instantiates a “first-day” workplace for agents (Fu et al., 13 Jan 2026). Major elements include:

  • Streaming Task MDP: Tasks arrive as a Poisson process (rate λ\lambda), each with a sampled priority pip_i and deadline di=ti+Δid_i = t_i + \Delta_i. The agent's scheduler π\pi maps the current belief-state and active task queue to the next action, under both a resource/time budget and precedence constraints.
  • Partial Observability and Active Exploration: Agents only observe partial state; critical clues (KK) for meta-tasks must be actively discovered (e.g., navigating file systems or querying external entities). Information gain is formalized as:

IG(abt)=H[bt]EoO(bt,a)[H[bt+1]]IG(a|b_t) = H[b_t] - \mathbb{E}_{o \sim O(\cdot | b_t, a)}[H[b_{t+1}]]

where btb_t is the current belief state and aa is the prospective action.

  • Exploration-Exploitation Tradeoff: The agent optimally balances information-seeking (to mitigate hallucination) and downstream utility, with explicit penalization for high-entropy outputs.
  • Online Learning: Experience at each interaction checkpoint is distilled into updated policy parameters θ\theta via teacher-student or RL-style objectives.

Key performance metrics include:

  • Success Rate (SR): Fraction of fully completed tasks.
  • Checkpoint Score (CS): Mean fraction of completed checkpoints per task.
  • Average weighted completion rate (WCR) and flow time, as well as regret w.r.t. an optimal clairvoyant policy.

5. Experimental Findings and Analysis

Direct benchmarking on AgentIF-OneDay yields several indicative findings (Chen et al., 28 Jan 2026, Fu et al., 13 Jan 2026):

  1. Overall System Performance: Among four principal agent systems—Manus 1.5, Genspark, ChatGPT-Agent (Pro), Minimax-Agent (Pro)—normalized final scores cluster tightly (Manus 0.645, Genspark 0.635, ChatGPT-Agent 0.626, Minimax 0.562). API-based agents (Genspark, Manus) and RL-finetuned agents (ChatGPT-Agent, Minimax) achieved comparable performance.
  2. Instruction-Following and Attachment Handling: Highest Instruction-Following accuracy was 0.766 (Genspark) and strongest Factuality was 0.731 (Manus). Robustness on tasks requiring attachment-based reasoning remains a fundamental differentiator, with Genspark leading at 0.691 attachment-conditioned score.
  3. Category and Domain Specialization: Performance varies by interaction type. Minimax bests others on iterative refinement (0.717); Genspark leads on latent instruction extraction (0.719), and Manus on open workflow execution (0.661). Domain rankings vary similarly, with distinct systems excelling at work, life, or study tasks.
  4. Workflow and Temporal Dynamics: In EvoEnv, the top agent achieves SR=0.35 and CS=0.63. Performance degrades as task concurrency increases (from 2 to 6 simultaneous tasks), with harder tasks seeing less severe performance dips (SR drop of 3–8%) compared to easy tasks (20–50%). Continual online learning provides marginal gains on hard tasks but can be detrimental on easy scenarios, suggesting limitations in agent policy adaptation.
  5. Exploration Limitations: Human guidance experiments indicate a sharp upper bound: checkpoint score can surge from ∼0.24 to ∼0.83 on hard tasks when human hints are provided, highlighting exploration and execution deficiencies in existing agents.

Latency benchmarks indicate that best speed-quality tradeoffs are obtained by Genspark (484 s/task) and Manus (500 s/task), while more deliberative strategies (Minimax, 1,416 s/task) incur substantial delays.

6. Challenges, Limitations, and Research Recommendations

AgentIF-OneDay and associated dynamic environments surface several unsolved challenges:

  • Latent Rule Inference: AI agents often fail to mine implicit rules embedded in attachments non-explicitly referenced by the task, underlining the need for stronger context-mining and multimodal reasoning modules.
  • State Consistency and Memory: Long-horizon iterative tasks require persistent, revision-aware memory; current approaches lose state across multiple refinement rounds.
  • Commoditization of Core Agentic Skills: Baseline capabilities—long-context understanding, file parsing, tool invocation—are now broadly comparable across modern LLM products; differentiation is shifting to product-level design, domain specialization, and integration.
  • Evaluation Alignment and Scalability: While LLM-as-judge is now feasible for large-scale evaluation, absolute alignment with human-grade judgments remains imperfect (highest observed agreement 80.1%).

Research priorities emerging from these findings include:

  • Investment in context-mining and latent rule-extraction algorithms, especially for attachment-based tasks.
  • Enhanced state-management for iterative and persistent workflows.
  • Tighter human-in-the-loop safeguards to mitigate hallucination and preserve factual integrity in ambiguous or partially specified scenarios.
  • Extension to multi-day (“OneWeek”) and calendar-integrated benchmarks for long-term planning and feedback cycles.

7. Impact and Future Directions

AgentIF-OneDay establishes a rigorously annotated, multimodal, and user-aligned benchmark for the next generation of general-purpose AI agents, providing both diagnostic breadth and depth across agentic capabilities essential for daily-life integration. Its synthetic-plus-human annotation pipeline enables scalable generation of new evaluation items and direct alignment with product-level user experience concerns (Chen et al., 28 Jan 2026, Fu et al., 13 Jan 2026).

A plausible implication is that as agentic competencies become commoditized, future competitive differentiation will arise through advanced multimodal integration, persistent and context-rich memory architectures, continual learning robustness, and product-user alignment. Scaling AgentIF-OneDay to encompass week-long, cross-task, and cross-user benchmarks represents a logical next step for both practical and research communities engaged in agent development and evaluation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentIF-OneDay.