Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

Published 1 Apr 2026 in cs.AI, cs.LG, and cs.MA | (2604.00842v1)

Abstract: Proactive agents that anticipate user needs and autonomously execute tasks hold great promise as digital assistants, yet the lack of realistic user simulation frameworks hinders their development. Existing approaches model apps as flat tool-calling APIs, failing to capture the stateful and sequential nature of user interaction in digital environments and making realistic user simulation infeasible. We introduce Proactive Agent Research Environment (Pare), a framework for building and evaluating proactive agents in digital environments. Pare models applications as finite state machines with stateful navigation and state-dependent action space for the user simulator, enabling active user simulation. Building on this foundation, we present Pare-Bench, a benchmark of 143 diverse tasks spanning communication, productivity, scheduling, and lifestyle apps, designed to test context observation, goal inference, intervention timing, and multi-app orchestration.

Summary

  • The paper introduces PARE, a formal evaluation framework that assesses proactive assistants via state-driven user simulations in multi-app environments.
  • It models interactions as a Stackelberg POMDP, capturing realistic mobile UI constraints and enforcing controlled, user-approved interventions.
  • Evaluation on 143 scenarios shows that large LLMs achieve around 40% success, highlighting persistent challenges in proactive digital assistant deployment.

Proactive Agent Research Environment: A Formal Evaluation Framework for Proactive Assistants

Introduction

The paper "Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants" (2604.00842) addresses a key problem in evaluating LLM (LM)–powered digital assistants: the inadequacy of current evaluation paradigms for proactive assistants. Most existing research platforms only support reactive agents responding to explicit human requests, relying on passively simulated users. To overcome this, the paper introduces the Proactive Agent Research Environment (PARE), a simulation framework that enables robust, ecologically valid evaluation of proactive agents via LM-based user simulators interacting in a state- and event-driven, multi-app digital environment.

By formalizing user and assistant interaction as an asymmetric Stackelberg POMDP, PARE captures essential properties of mobile user interfaces: restricted, screen-by-screen user navigation (modeled as per-app FSMs), as opposed to unrestricted backend API access for the assistant. This foundation allows researchers to study complete agent–user interaction loops, including goal inference, proposal, verification, and execution, while modeling environment events, system notifications, and tool failures. The framework includes a large-scale, procedurally generated benchmark (PARE-Bench) with 143 multi-app scenarios, supporting systematic evaluation of various LLM-based proactive assistant architectures. Figure 1

Figure 1: The PARE framework comprises an event-based environment with FSM-modeled app transitions; users navigate stateful interfaces while assistants interact via flat backend APIs.

The paper thoroughly situates PARE in the context of recent agent benchmarks and user simulators. Frameworks such as Generative Agents [park2023generativeagentsinteractivesimulacra], τ\tau-bench [yao2024taubenchbenchmarktoolagentuserinteraction], and ToolSandbox [luToolSandboxStatefulConversational2025] have demonstrated that LLMs can simulate plausible human actions in interactive environments. However, these works have pivotal limitations: either restricting users to passive roles, omitting grounded state-altering action execution, or failing to capture realistic navigation constraints and event-driven flows seen in mobile devices.

Furthermore, previous work on LLM-based proactive agents (e.g., ProactiveAgent [luProactiveAgentShifting2024], ContextAgent [yangContextAgentContextAwareProactive2025]) evaluates models only via offline datasets or through abstract textual simulation, lacking state-aware user interaction and unable to measure real goal completion outcomes. PARE addresses these gaps by enforcing environment dynamics that mirror actual software constraints, supporting both agent and user actions that are tightly coupled to the actual app states and event streams.

Framework Architecture

PARE extends the Agent Research Environment (ARE) [froger2025arescalingagentenvironments] by incorporating active user simulators as first-class agents. A pivotal feature is the asymmetric action and observation space: the user is forced to operate via explicit FSM transitions that correspond to real application screens and allow only limited action sets at each step, while the assistant observes all user actions and notifications and can execute flat APIs for any app (Figure 2). Figure 2

Figure 2: Comparison between non-FSM agents with direct API access and FSM-based user simulation requiring realistic navigation and screen transitions.

This design is instantiated across a portfolio of mobile apps (e.g., Email, Messaging, Calendar, Shopping), each with detailed FSMs specifying screen-level state and legal user transitions. For example, sending a message as a user requires navigating through several UI states (open app → search → compose), while the assistant can immediately issue the API call.

User and assistant communicate and negotiate tasks through a shared message interface (PASAgentUserInterface), enforcing a strict protocol: the assistant can only execute state-altering actions with explicit user acceptance, respecting autonomy and providing controlled evaluation of intervention quality.

Benchmark and Scenario Generation

To enable systematic evaluation at scale, the authors introduce PARE-Bench, a benchmark of 143 scenarios spanning communication, scheduling, productivity, and lifestyle tasks. Each scenario is auto-generated in four pipeline stages: scenario description, application state/data seeding, event flow specification, and validation. A scenario is only admitted after passing execution-based verification, which ensures it is both syntactically and semantically fit (Figure 3). Figure 3

Figure 3: The LLM-based scenario generation pipeline for PARE: description, state seeding, event flow construction, and validation with oracle-based feedback and deduplication.

Each scenario specifies initial and terminal states, a timed sequence of exogenous events (e.g., new emails, notifications), and an oracle for automated verification of goal completion. PARE supports the injection of tool failures and notification noise for robustness analysis, allowing for controlled ablations and stress testing of agents.

User Simulator and Proactive Assistant Agents

The user simulator is a strict, context-sensitive ReAct-based agent, receiving the current app, state, and available actions each turn. It must navigate FSM-defined screens, process truncated notifications (mirroring mobile OS behaviors), and respond to assistant proposals with high selectivity—as specified in rigorous prompting.

The proactive assistant is implemented as a two-phase observer–executor: it alternately monitors user actions/environment events, proposes concrete plans when sufficiently confident, and, upon user acceptance, transitions to an executor subagent with full API access for multi-app task completion. Critically, no action is taken without user approval. All action/opinion cycles are formalized with a thought–action–observation pattern. Figure 4

Figure 4: Sequence diagram showing a proactive scenario: assistant observes event → infers constraint → proposes plan to user → executes only after user confirmation.

Experimental Evaluation

Experimental Configuration

Seven LLMs are evaluated: four frontier models (Claude 4.5 Sonnet, GPT-5, Gemini 3 Pro/Flash) and three compact open-weight models (Qwen 3 4B, Llama 3.2 3B, Gemma 3 4B). The main metrics are:

  • Task Success Rate (RSucceedR_\textrm{Succeed}): Fraction of user goals completed.
  • Proposal Acceptance Rate (RAcceptR_\textrm{Accept}): Fraction of assistant proposals accepted.
  • Proposal Rate: Proposals per turn (lower is better).
  • Read Actions: Number of information-gathering queries per scenario.

Frontier models achieve markedly higher success (Gemini 3 Flash: 42.1%, Claude: 42.0%)—but still only complete about 40% of proactive tasks, indicating substantial headroom for future work. Compact models underperform and exhibit elevated inconsistency and passivity.

Robustness to Failures and Noise

The framework supports stress tests under tool failures and event noise: Figure 5

Figure 5: Success, proposal, and acceptance rates under increasing tool failure probability; larger models are robust, while small models degrade rapidly.

Figure 6

Figure 6: Evaluation with increasing notification noise; top models maintain proposal quality, whereas smaller models’ success drops sharply.

Across these perturbations, only Claude maintains stable success under 40% tool failure and high noise; smaller models both misreact and overreact (high false positives or missed opportunities). Model ablations show that large model quality differences manifest as substantial proposal selectivity/precision, not merely in generation fluency.

Theoretical Model

PARE formalizes user–agent joint action as a Stackelberg POMDP: the user (leader) acts first, modifying the observable global state, followed by the assistant (follower), proposing plans or gathering information with full observability over user action but not vice versa. Success is thus a function of user/assistant policy composition, proposal timing, and user selectivity—a configuration uniquely captured in PARE and missing in prior paradigms.

Key Findings

  1. Action Asymmetry: Modeling the user as a state-constrained actor (FSM per app, truncated observation) while granting the assistant direct access to backend APIs enables realistic evaluation of intervention and user experience, capturing distinctions missed in pure API call–based environments.
  2. Intervention Quality and Timing: Proposal timing is critical. Large models succeed via higher acceptance and fewer, more precise, context-aware interventions. Weaker models either intervene prematurely or fail to act at all.
  3. User–Model Interplay: Systematic user simulation ablation shows that PARE preserves ranking stability across varying user model personalities (strict, permissive), but absolute success rates are sensitive to user idiosyncrasies, emphasizing the need for careful protocol specification in benchmarking.
  4. Robustness: Only state-of-the-art LLMs maintain performance under significant simulated error/noise; open compact models (Qwen 4B, Gemma 3/4B) frequently fail to distinguish relevant signals, confirming an open research challenge in small-model proactive assistant deployment.
  5. Observe–Execute Decomposition: Decoupling proactive task proposal (observation/decision) from execution is essential for effective, privacy-preserving assistant deployment, allowing on-device models to perform continuous monitoring with cloud execution only upon explicit user consent.

Implications and Future Directions

PARE marks a significant methodological advance for agent-oriented evaluation. Practically, it enables the development of privacy-preserving, autonomous, yet user-aligned digital assistants, supporting edge deployment and systematic safety auditing. The API-level FSM abstraction maintains a natural privacy boundary (agents "observe" only actions, not full screen content).

Theoretically, its POMDP formulation opens research directions in RL-based policy optimization for Stackelberg settings, goal inference from state/action trajectories, realistic modeling of mixed-initiative human–agent workflows, and systematic analysis of the interplay between agent intervention, user trust, and long-term adoption.

A current limitation is the absence of direct visual grounding; extending the framework toward UI screenshot–driven multimodal interaction (e.g., [zhang2023appagentmultimodalagentssmartphone, hong2024cogagentvisuallanguagemodel]) would bridge the gap to true end-user deployment. Further, deeper modeling of user diversity and theory-of-mind (e.g., [zhou2026tomsweusermentalmodeling]) would increase generalizability.

Conclusion

This work sets a rigorous new bar for proactive assistant evaluation by introducing an environment and benchmark that operationalize realistic user and agent capabilities, contextually rich scenarios, robust execution and evaluation pipelines, and extensive diagnostic metrics. The persistent success gap between large and small models, and the finding that execution capability (not intent inference) is the principal bottleneck for smaller models, indicates substantial open challenges and opportunities in agent architecture design, user–agent theory, and scalable, privacy-aware deployment of next-generation digital assistants.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper is about making and testing “proactive” digital assistants—helpers that don’t wait for you to ask, but watch what you’re doing and offer to help at the right time. The authors build a realistic simulator called PARE (Proactive Agent Research Environment) and a big set of challenges called PARE‑Bench to see how well different AI assistants can notice what a user needs, suggest helpful actions, and carry them out across multiple apps (like Messages, Email, Calendar, and To‑Do lists).

Imagine you’re writing a shopping list and your roommate texts that you’re out of soap. A proactive assistant would notice that and add “soap” to your list without you typing a request. This paper creates a safe, realistic “practice world” to teach and test assistants like that.

What were the main goals?

The paper focuses on a few clear questions:

  • How can we simulate real users in a realistic way so proactive assistants can be tested fairly?
  • Can an assistant:
    • Watch what a user is doing and understand their goal?
    • Decide the right moment to offer help (not too often, not too late)?
    • Complete tasks that involve several different apps?
  • Which AI models are best at this, and how consistent and robust are they?

How did the researchers do it?

They built a virtual phone environment that imitates how people actually use apps and how assistants would work behind the scenes. Here are the key ideas, explained simply:

  • Apps as “state machines”: Think of each app (like Messages) as a series of screens and buttons. You can only do certain things depending on which screen you’re on—just like real life. This is called a finite state machine (FSM). For example, to send a text, a user must: 1) Open the Messages app, 2) Find a conversation, 3) Open it, 4) Type a message, 5) Send it.

A human (or simulated user) must go step by step. The assistant, however, can use “backstage” tools to jump straight to the action if the user approves.

  • Asymmetric abilities (realistic difference between user and assistant):
    • The user is “on the phone,” tapping around screens and seeing only what a user would see.
    • The assistant has a flat list of tools across all apps (a backstage pass) and can do things directly—after asking permission.
  • Turn-based interaction (like taking turns in a game): 1) The user acts first (taps around). 2) The assistant watches, gathers info, and either waits or proposes a plan (“I can add soap to your shopping list. OK?”). 3) The user accepts or rejects. 4) If accepted, the assistant executes the plan; if not, it goes back to watching.
  • Two-part assistant design:
    • Observe mode: the assistant only reads and thinks about what’s happening, deciding whether to propose help.
    • Execute mode: after the user says “yes,” the assistant acts quickly and uses any needed tools across apps.
  • Scenarios with events: The environment includes realistic events like incoming emails, messages, and reminders. This tests if assistants can spot what matters and ignore distractions.
  • Benchmark and evaluation: They made PARE‑Bench with 143 tasks across communication, scheduling, productivity, and lifestyle apps. They measured:
    • Success rate: Did the assistant’s help actually complete the user’s goal?
    • Acceptance rate: How often did users accept the assistant’s proposals?
    • Proposal rate: How often did the assistant propose something (too many proposals can be annoying)?
    • Info gathering: How much did the assistant read/inspect before acting?

What did they find?

  • Top models are good—but far from perfect: The best large models succeeded only about 42% of the time on average across tasks. That means there’s a lot of room to improve proactive assistants.
  • Quality beats quantity of proposals: The best-performing assistant (Claude 4.5 Sonnet) proposed help less often but got the highest acceptance rate. Another strong model (Gemini 3 Flash) did similarly well but tended to propose more often and got accepted slightly less. Proposing too often without enough context hurts usefulness.
  • Reading more helps: Better assistants tend to “look around” more—checking messages, calendars, and emails before proposing a plan. This careful observation leads to smarter, more acceptable suggestions.
  • Smaller models struggle and are inconsistent: Smaller open-source models were less successful and less reliable across repeated runs. Some had decent proposal acceptance but still failed at actually completing tasks, suggesting the execution step is hard for them.
  • Robustness to problems and noise varies:
    • Tool failures (like functions occasionally breaking) didn’t faze the best models much, but smaller models degraded more.
    • Noisy environments (lots of irrelevant notifications) also affected some models more than others. The best models stayed steadier.

Why does this matter?

  • Realistic training and testing: Proactive assistants should respect how people actually use phones—screen by screen—so they help at the right time and in the right way. PARE makes that possible by simulating both the user and the assistant realistically.
  • Better user experience and safety: The assistant must ask before doing things. This keeps humans in control and avoids unwanted changes. The framework encourages assistants that are helpful but not bossy.
  • Privacy-aware design: Because proactive assistants observe user activity, the authors highlight using on-device models and only exposing higher-level “actions” (like “user opened ‘Messages’”) instead of raw screen content.
  • A shared testbed for progress: PARE and PARE‑Bench give researchers and developers a common playground to compare ideas, improve assistants, and make them more reliable, respectful, and useful.

In short, this work builds a realistic “practice world” where proactive assistants can learn to be good helpers—spotting needs, asking permission, and getting tasks done across apps—while keeping the user in charge.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concise list of what remains missing, uncertain, or unexplored in the paper, framed to be actionable for future researchers.

  • Human-grounded evaluation is absent: results rely entirely on an LLM-based user simulator (GPT-5-mini), with no user studies to validate proposal acceptability, perceived helpfulness, interruption cost, or autonomy preservation in real use.
  • Validity of the simulated user’s acceptance/rejection behavior is untested: no calibration against human judgments, inter-rater reliability, or evidence that the simulator’s thresholds for “strict acceptance” align with users’ true preferences.
  • Ecological realism of notifications and event flows is not validated: Poisson noise injection and scripted events are not matched to empirical distributions of mobile notification timing, content, and user receptivity from real-world datasets.
  • Visual/GUI grounding is missing: the environment abstracts apps as FSMs and APIs without modeling visual layout, affordances, or perception errors that GUI agents face (e.g., misclicks, occlusions), limiting transfer to real mobile interfaces.
  • FSM coverage and fidelity are unclear: the completeness of app FSMs (edge cases, optional flows, permissions, multi-step confirmations, multi-threaded state changes) and how they scale to complex, modern apps is not evaluated.
  • Concurrency and asynchrony are under-modeled: the fixed Stackelberg turn-taking with a max of 10 turns does not capture overlapping actions, preemption, background processes, or continuous monitoring common in mobile usage.
  • Intervention timing is not directly measured: despite claiming to assess timing, no explicit metrics quantify whether proposals arrived too early/late, interrupted at poor moments, or respected user busy/idle states.
  • Cost-of-errors is not assessed: there is no quantitative treatment of false positives/negatives, error severity (e.g., sending wrong messages, scheduling mistakes), or recovery behaviors; acceptance rate alone cannot capture harm.
  • Safety and consent granularity are limited: the binary accept/reject model lacks micro-permissioning (e.g., partial plan approval), escalation, rollback, or safeguards for sensitive domains (finance, health, privacy).
  • Personalization and long-term adaptation are unaddressed: the agent does not learn user preferences, routines, or thresholds for proactivity over time, nor does the benchmark include longitudinal scenarios for personalization.
  • POMDP formalization lacks learning/optimization methods: the paper provides a Stackelberg POMDP framing but no concrete algorithms for policy learning, confidence calibration, reward shaping, or optimal proposal decision-making.
  • No comparison to reactive baselines: the paper does not quantify the benefit (or harm) of proactivity versus high-quality reactive agents on the same tasks, leaving the net value of proactivity unmeasured.
  • Observation asymmetry design is untested: while assistants see full serialized content and users see truncated notifications, the paper does not study how varying observation budgets (for either party) affect proposal quality and success.
  • Execution bottlenecks are not diagnosed in detail: smaller models underperform in execution, but the paper does not analyze failure modes (tool selection, argument filling, cross-app orchestration, error recovery) to guide targeted improvements.
  • Robustness dimensions are narrow: evaluation covers tool failure probability and noise density, but omits latency, network instability, permission prompts, API rate limits, partial tool unavailability, and inconsistent app states.
  • Scenario generation quality and bias are not audited: LLM-generated scenarios are “verified by authors,” but there is no formal diversity analysis, difficulty calibration, bias checks, or reproducibility assessment of the generation pipeline.
  • Benchmark coverage is insufficiently characterized: the paper does not report per-domain/task difficulty distributions, ablate multi-app versus single-app scenarios, or identify the subset of tasks driving most failures.
  • Proposal content quality is not measured beyond acceptance: clarity, specificity, justification, and user-understandable explanations are not scored, limiting insights into why users accept/reject proposals.
  • Limited analysis of iteration budgets: the choice of 5 observe and 10 execute iterations is treated as a parameter, but the paper does not study how iteration budgets impact proposal quality, execution reliability, or user experience.
  • Assistant action visibility to the user is unrealistic: users cannot observe assistant actions (only outcomes/messages), whereas real assistants often provide progress/status; the impact of transparency on trust and acceptance is unexplored.
  • On-device constraints are not modeled: although the paper argues for on-device deployment, it does not evaluate compute/memory/latency trade-offs, battery impact, or performance under realistic edge hardware limitations.
  • Privacy guarantees are not formalized: API-level abstraction is claimed as a privacy boundary, but there is no threat model, data retention policy, or empirical audit of what sensitive information remains exposed to the assistant.
  • Generalization to richer app ecosystems is unknown: the benchmark uses a fixed set of apps; how the framework handles OS-level features (deep links, intents), third-party app diversity, and dynamic app updates is not demonstrated.
  • No learning from user feedback: the assistant does not adapt its future proposal policies based on past accept/reject outcomes, leaving open how to improve calibration and reduce user annoyance over time.
  • Limited error handling and rollback: the system does not model plan interruption, partial failure recovery, undo/redo semantics, or user-requested corrections after execution, which are critical for safe proactive assistance.
  • Missing ablation of the Observe–Execute architecture: there is no comparison against unified agents or alternative decompositions (e.g., triage/planning/execution modules) to justify the chosen architecture.
  • User-model sensitivity analysis is incomplete: although referenced, the effect of different user simulators on reported metrics is not shown here; conclusively determining benchmark robustness to user-model choice remains open.
  • External validity to real mobile platforms is unverified: no deployment or case study demonstrates that gains in the simulated environment translate to effectiveness on Android/iOS with real users and apps.

Practical Applications

Practical Applications of the Paper’s Findings

Below are actionable, real-world applications derived from the paper’s framework, methods, and benchmark. Each item notes target sectors, plausible tools/products/workflows that could emerge, and key assumptions or dependencies that affect feasibility.

Immediate Applications

These are deployable now or with modest integration effort.

  • Proactive-assistant CI/CD benchmarking and regression testing
    • Sector: Software, Mobile OS, Productivity Apps
    • Tools/products/workflows: Integrate the PARE framework and its 143-scenario benchmark into continuous integration pipelines; set acceptance-success-proposal-rate gates; nightly stress tests with noise and tool-failure sweeps; dashboards tracking Success@k and proposal quality.
    • Assumptions/dependencies: FSMs or wrappers for in-scope apps; flat APIs or realistic mocks; engineering time to model app states; compute budget for multi-run evaluation.
  • Proposal timing and “annoyance” calibration
    • Sector: HCI/UX, Assistant Teams
    • Tools/products/workflows: Tune Observe→Execute thresholds; optimize proposal-rate vs. acceptance-rate curves; run A/B tests by varying user-simulator strictness and notification verbosity; deploy guardrails to avoid over-proposing.
    • Assumptions/dependencies: Validity of LLM-based user acceptance as a proxy for real users; policy for maximum proposal frequency.
  • Multi-app orchestration validation for assistants
    • Sector: Productivity (email, calendar, tasks), Enterprise SaaS
    • Tools/products/workflows: End-to-end orchestration tests (e.g., calendar + email + notes) using the flat API executor; oracle-based verification for success conditions; regression suites for cross-app flows.
    • Assumptions/dependencies: Connectors to each app; consistent identifiers across app data (contacts, events, threads).
  • SDK for app teams to declare FSMs of their UIs
    • Sector: Mobile/desktop app developers
    • Tools/products/workflows: “FSM Authoring Studio” with schema, codegen, and validation; stubs that map UI screens → user-only actions → backend APIs; instrumented navigation logging.
    • Assumptions/dependencies: Developer time to author and maintain FSMs; UI churn management; alignment with platform accessibility trees.
  • On-device model feasibility testing and sizing
    • Sector: Edge AI, Mobile Platforms
    • Tools/products/workflows: Side-by-side runs of small open-weight models vs. frontier models on PARE; measure trade-offs in success and acceptance; caching/RAG experiments to boost small models.
    • Assumptions/dependencies: Device hardware constraints; latency/energy budgets; privacy policies mandating on-device inference.
  • Data generation for training propose-vs-execute policies
    • Sector: ML Ops, Model Training
    • Tools/products/workflows: Collect simulation trajectories with acceptance feedback; SFT/behavior cloning on Observe and Execute phases; RL with acceptance and success as reward signals.
    • Assumptions/dependencies: Rights to use generated data; sim-to-real transfer effectiveness; careful de-duplication and scenario diversity.
  • Robustness testing harness (tool failures and event noise)
    • Sector: Reliability/SRE for Assistants
    • Tools/products/workflows: “Noise/Failure Stress Harness” to quantify resilience; ensure graceful degradation and alternative plan search when tools fail; log-based root cause analysis.
    • Assumptions/dependencies: Controllable failure injection in the test env; clear fallback policies when APIs fail.
  • Privacy-by-design auditing at the API/action level
    • Sector: Compliance, Trust & Safety
    • Tools/products/workflows: Test that assistants rely on API-level action observation rather than screen scraping; produce auditable event logs for proposals, acceptances, executions.
    • Assumptions/dependencies: Mapping between policy requirements and logged events; alignment with platform permissions and data minimization principles.
  • Procurement and vendor benchmarking using common metrics
    • Sector: Enterprise IT, Public Sector
    • Tools/products/workflows: RFP-style bake-offs using PARE-Bench; standardized reporting (success rate, acceptance rate, proposal rate under noise/failure); leaderboards and SLAs tied to acceptance thresholds.
    • Assumptions/dependencies: Agreement on metrics and test sets; representative scenarios for target use cases.
  • HCI research and classroom sandbox
    • Sector: Academia, Education
    • Tools/products/workflows: Course labs on proactive agents and mixed-initiative design; studies on interruptibility, autonomy, and trust using configurable user-simulator strictness and event verbosity.
    • Assumptions/dependencies: Instructor adoption; access to compatible LLMs; IRB considerations for human studies if combined with user trials.
  • Accessibility-oriented pathway reduction analysis
    • Sector: Accessibility, Inclusive Design
    • Tools/products/workflows: Use FSM traces to quantify steps saved by proactive interventions; identify friction points and opportunities for just-in-time assistance.
    • Assumptions/dependencies: Appropriately designed scenarios/personas; domain expert review to ensure accessibility relevance.
  • Internal policy testing for consent-gated execution
    • Sector: Product Policy, Governance
    • Tools/products/workflows: Enforce Observe→Propose→User-accept→Execute workflow; verify consent logging and “no silent execution” rules; test escalation and rollback.
    • Assumptions/dependencies: System UX surfaces for user consent; audit log storage and review workflows.

Long-Term Applications

These require further research, scaling, standardization, or ecosystem development.

  • OS-level proactive assistants with consent-gated execution
    • Sector: Mobile/Desktop OS, Consumer Software
    • Tools/products/workflows: System-wide observers (read-only), cross-app flat APIs for execution, and native consent sheets; background monitoring with user-controlled schedules.
    • Assumptions/dependencies: Platform APIs that expose consistent, secure backends; sustained success/acceptance above target thresholds; regulatory buy-in.
  • Certification and standards for proactive agents
    • Sector: Policy/Regulation, Standards Bodies
    • Tools/products/workflows: PARE-like evaluation suites for certification; noise/failure stress criteria; reporting standards for proposal/acceptance/exec logs and human-in-control requirements.
    • Assumptions/dependencies: Multi-stakeholder consensus; third-party auditors; versioned public benchmarks.
  • Automatic FSM extraction from real apps via GUI grounding
    • Sector: Tooling, DevOps
    • Tools/products/workflows: Record-and-generalize user flows into FSMs from accessibility trees, visual UI grounding, and usage telemetry; continuous updates as UIs evolve.
    • Assumptions/dependencies: High-quality GUI-grounding models; platform hooks to record flows; privacy-preserving data collection.
  • Personalized user simulators calibrated to real behavior
    • Sector: ML/HCI, Personalization
    • Tools/products/workflows: Preference models for acceptance thresholds, interruptibility, and autonomy; on-device federated fine-tuning of user simulators; persona libraries for target demographics.
    • Assumptions/dependencies: Consent-based telemetry; differential privacy; robust generalization across personas.
  • Training robust small on-device proactive models
    • Sector: ML, Edge AI
    • Tools/products/workflows: Distill Observe and Execute skills from large models into compact ones; RL with acceptance/success rewards; retrieval-augmented small models specialized per app domain.
    • Assumptions/dependencies: Efficient distillation pipelines; device acceleration (NPUs); careful latency-utility trade-offs.
  • Cross-ecosystem IoT and energy management orchestration
    • Sector: IoT, Energy
    • Tools/products/workflows: FSMs for appliances and home hubs; proactive suggestions for demand response, off-peak scheduling, “just-in-time” automation with consent.
    • Assumptions/dependencies: Interoperable standards (e.g., Matter); robust security; user trust and override controls.
  • Healthcare appointment/medication and portal workflows
    • Sector: Healthcare
    • Tools/products/workflows: FSMs for patient portals (scheduling, refills, forms); consent-gated proactive reminders and prefill; end-to-end state tracking for adherence.
    • Assumptions/dependencies: HIPAA and regional compliance; provider partnerships; rigorous safety/efficacy validation.
  • Finance assistants for bills, disputes, and budgeting
    • Sector: Finance/Fintech
    • Tools/products/workflows: FSMs for banking portals; proactive bill pay with approval; fraud dispute flows; budget alerts and cross-account reconciliation.
    • Assumptions/dependencies: Compliance (KYC/AML, PSD2/Open Banking); robust identity and risk controls; clear liability frameworks.
  • Enterprise workflow copilots across SaaS suites
    • Sector: Enterprise Software
    • Tools/products/workflows: Orchestrate email, calendar, CRM, ticketing; proactive triage and scheduling; admin policies controlling proposal scopes and audit trails.
    • Assumptions/dependencies: API access across vendors; data governance; change management and IT buy-in.
  • Education copilots across LMS and calendars
    • Sector: EdTech
    • Tools/products/workflows: FSMs for LMS/assignment portals; proactive study plans; parent/teacher oversight with consent and logs.
    • Assumptions/dependencies: LMS integrations; privacy (FERPA, GDPR-K); user trust and transparency.
  • Human-robot collaboration and operator assistants
    • Sector: Robotics, Industrial
    • Tools/products/workflows: Map machine states to FSMs; proactive step suggestions and checklists; consent-gated execution of safe subroutines.
    • Assumptions/dependencies: Safety certification; reliable state estimation; strict override and e-stop protocols.
  • Government and public-service assistants
    • Sector: Public Sector
    • Tools/products/workflows: Proactive guidance for complex forms and appointments (tax, visas, benefits); consent gating and transparent logs; accessibility-focused variants.
    • Assumptions/dependencies: Modernized APIs; public-sector security and privacy compliance; multilingual support.
  • Security and audit-focused assistant sandboxes
    • Sector: Cybersecurity
    • Tools/products/workflows: Controlled environments to test agents for data exfiltration risks, unauthorized execution, and prompt/tool injection; red-team benchmarks for proactive agents.
    • Assumptions/dependencies: Standard threat models; integration with SOC tooling; secure logging and forensic pipelines.

Cross-Cutting Assumptions and Risks

  • The ecological validity of LLM-simulated users may differ from real users; mixed-methods validation (user studies + simulation) will be important.
  • FSM coverage must be kept current with UI changes; automatic extraction or close dev collaboration is key.
  • Flat backend APIs for cross-app execution are required for assistants; OS support and permissions are a gating factor.
  • Privacy and consent are central: on-device inference and API-level observation reduce exposure, but policy and UX must enforce human control.
  • Current success rates (≈42% for top models in the paper) indicate more model and system work is needed before broad deployment in safety-critical domains.

Glossary

  • Acceptance Rate: The proportion of the assistant’s proposals that the user accepts; used as a primary evaluation metric. "we evaluate proactive assistants on Acceptance Rate (RAcceptR_\textrm{Accept}) and Task Execution Success (RSucceedR_\textrm{Succeed})."
  • Agent Research Environment (ARE): A platform for building and evaluating agent environments with apps, scenarios, and verification; this work extends it for proactive settings. "Agent Research Environment (ARE) platform"
  • BaseAgent: The core agent class in ARE that implements the Reasoning → Action → Observation loop used by both user simulators and assistants. "Both agents are built on ARE's BaseAgent"
  • dual-control environment: An evaluation setting where both the agent and a user (or user simulator) can take actions, enabling interaction-focused testing. "creating a dual-control environment for technical support scenarios."
  • ecologically valid: Describes simulations that closely mirror real-world conditions and constraints. "ecologically valid user simulator agents"
  • event-based environment: An environment where asynchronous events drive state changes and notifications to agents and users. "an event-based environment that models Stateful App transitions."
  • Execute Mode: The assistant’s operational phase in which it carries out approved plans using the full set of app APIs. "Execute Mode"
  • Finite State Machine (FSM): A formal model of app/user states and transitions used to constrain user actions to realistic, screen-by-screen navigation. "finite state machines (FSMs)"
  • flat API: An interface exposing all app tools uniformly to the assistant, without state-dependent restrictions, for efficient access. "flat API structure"
  • goal inference: The process by which a proactive assistant deduces user objectives from observed actions and context. "goal inference"
  • grounded execution semantics: The requirement that actions have concrete, state-changing effects in the environment, enabling realistic evaluation. "grounded execution semantics"
  • mixed-initiative interface design: Principles for systems where control alternates or is shared between user and agent, balancing autonomy and user control. "mixed-initiative interface design"
  • multi-app orchestration: Coordinating actions across multiple applications to complete complex tasks. "multi-app orchestration"
  • observation space: The formal set of observations available to an agent or user at a given time in the environment. "observation space O\mathcal{O}"
  • Observe Mode: The assistant’s monitoring phase in which it gathers information and decides whether to propose help. "Observe Mode"
  • oracle events: Scenario-defined events specifying expected proactive behaviors, used for evaluation and validation. "oracle events"
  • oracle validation: Scenario checks that verify whether the assistant’s actions meet predefined success criteria. "oracle validation"
  • plan-then-execute pattern: An agent design approach that separates planning (proposal) from action execution to preserve user control. "plan-then-execute pattern"
  • Poisson process: A stochastic process used here to schedule random noise events (e.g., spurious notifications). "using a Poisson process"
  • ReAct: An agent framework that interleaves explicit reasoning with actions and observations in a loop. "ReAct"
  • Stackelberg POMDP: A partially observable decision process with a leader–follower structure used to formalize agent–user interactions. "Stackelberg POMDP"
  • Stackelberg turn-based loop: A leader–follower turn structure where the user acts first and the assistant responds. "Stackelberg turn-based loop"
  • Stackelberg turn-taking structure: The specific interaction order in which the user moves, then the assistant plans and possibly acts. "Stackelberg turn-taking structure"
  • state-dependent action space: A constrained set of actions available to the user based on the current app/UI state. "state-dependent action space"
  • stateful app interfaces: App interfaces that expose different actions depending on internal UI states, mirroring real device behavior. "stateful app interfaces"
  • tool failure simulation: Injected failures of tools/APIs to test assistant robustness under unreliable conditions. "tool failure simulation"
  • user-environment interface: The restricted interface through which the user simulator accesses apps and tools according to current UI state. "user-environment interface"

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 72 likes about this paper.