Papers
Topics
Authors
Recent
Search
2000 character limit reached

DynaWeb: Scalable RL for Web Agents

Updated 5 February 2026
  • DynaWeb is a model-based reinforcement learning framework that leverages LLMs to simulate web interactions and predict page transitions for safe agent training.
  • It constructs a data-driven web world model from millions of transition tuples, enabling synthetic 'dream' rollouts for scalable policy optimization.
  • Experimental benchmarks on WebArena and WebVoyager show significant improvements in task success rates, underscoring its potential for scalable web agent development.

DynaWeb is a framework for model-based reinforcement learning (MBRL) of autonomous web agents that leverages LLMs to simulate web interactions. DynaWeb constructs a “web world model”—a data-driven simulator that predicts page transitions and state changes—allowing web navigation policies to be trained through synthetic “dreams” rather than expensive, high-risk interaction with the live internet. It addresses the prohibitive costs, partial observability, and environmental risk inherent in online RL for web-based agents by combining high-fidelity simulation, expert demonstration interleaving, and scalable policy optimization techniques. Experimental results on WebArena and WebVoyager benchmarks demonstrate statistically significant improvements in agent task success rate relative to prior methods, establishing DynaWeb as a scalable RL paradigm for general-purpose web agent development (Ding et al., 29 Jan 2026).

1. Problem Setting and Motivation

DynaWeb formulates autonomous web navigation as a partially observed Markov decision process (POMDP)

(S,A,O,T,R)(\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{T}, \mathcal{R})

where:

  • S\mathcal{S}: The (hidden) full state of browser and web;
  • A\mathcal{A}: Atomic browser actions (click, type, scroll, go_back, stop);
  • Ω\Omega: Mapping SO\mathcal{S} \rightarrow \mathcal{O}, yielding observations (typically represented as an accessibility tree oto_t);
  • T\mathcal{T}: State-action transition dynamics;
  • R\mathcal{R}: Task-completion reward.

Direct online RL is inefficient and hazardous due to page non-determinism, the potential for irreversible or costly actions, and the reality that large-scale data collection may be operationally or ethically infeasible. DynaWeb’s core innovation is to circumvent these barriers by learning a high-fidelity world model pϕp_\phi to simulate web transitions, thereby enabling large-scale RL via “imagination” (Ding et al., 29 Jan 2026).

2. Architecture and MBRL Workflow

DynaWeb comprises two principal LLM-based components:

  • The web world model pϕp_\phi, a generative simulator trained on millions of real transition tuples (I,ot,at,ot+1)(I, o_t, a_t, o_{t+1}), where II denotes system instructions.
  • The agent policy πθ\pi_\theta, itself an LLM mapping the observation-action history and user query qq to a candidate action ata_t and chain-of-thought (CoT) rationale hth_t.

At each simulated step, pϕp_\phi predicts both the next accessibility tree (via Δ\Delta-patches applied to oto_t) and an NL reasoning trace, given II, oto_t, and ata_t. Rollouts are generated by alternating between policy sampling and world model prediction:

(ht,at)πθ(I,q,o1:t,h1:t1,a1:t1)(h_t, a_t) \sim \pi_\theta(\cdot|I, q, o_{1:t}, h_{1:t-1}, a_{1:t-1})

(r,Δ)pϕ(r,ΔI,ot,at)(r, \Delta) \sim p_\phi(r, \Delta|I, o_t, a_t)

o^t+1=Patch(ot,Δ)\hat{o}_{t+1} = \text{Patch}(o_t, \Delta)

Resulting “imagined” trajectories are used for policy optimization.

Expert trajectory interleaving is a key stabilizing mechanism: in each update batch, 50% of rollouts are sampled from a dataset of real expert demonstrations rather than world model predictions. This anchors the agent’s learning to authentic web behaviors and mitigates model drift and compounding simulation errors (Ding et al., 29 Jan 2026).

3. Policy Optimization via GSPO

DynaWeb utilizes Group Sequence Policy Optimization (GSPO) to maximize expected task-completion reward:

J(θ)=Eτπθ[ r(τ) ]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\ r(\tau) \ ]

GSPO is applied on batches of GG trajectories {τ^i}\{\hat{\tau}^i\}, measuring a per-trajectory token-level importance ratio:

si(θ)=(πθ(yiq,o1)πθold(yiq,o1))1/yis^i(\theta) = \bigg(\frac{\pi_\theta(y^i|q,o_1)}{\pi_{\theta_{\rm old}}(y^i|q,o_1)}\bigg)^{1/|y^i|}

with the objective:

JGSPO(θ)=E[1Gi=1Gmin(si(θ)A^i,clip(si(θ),1ε,1+ε)A^i)]\mathcal{J}_{\rm GSPO}(\theta) = \mathbb{E}\Bigl[\frac1G\sum_{i=1}^G \min(s^i(\theta)\,\hat A^i,\,\text{clip}(s^i(\theta),1-\varepsilon,1+\varepsilon)\,\hat A^i)\Bigr]

where A^i\hat{A}^i is the estimated advantage of each trajectory.

Ablation studies indicate that a “dream” rollout horizon of H45H\approx4\text{–}5 best balances trajectory diversity and model fidelity, while expert interleaving at $40$–$50$% is close to optimal. Purely synthetic rollouts degrade performance by propagating simulation bias, while insufficient “dreaming” underutilizes the world model’s sample generation capability (Ding et al., 29 Jan 2026).

4. Experimental Evaluation and Results

DynaWeb is evaluated on the WebArena and WebVoyager benchmarks:

  • WebArena: 812 tasks (Reddit, GitLab, Maps, CMS, Shopping) in isolated Docker instances.
  • WebVoyager: 643 live-browser tasks across 15 real-world sites (e.g., Amazon, BBC News, Coursera, Google Maps).

Benchmarks compare DynaWeb to:

  • Baseline vanilla LLMs (Llama-3.1-8B-Instruct);
  • Proprietary commercial models (GPT-4o);
  • Supervised finetuning (NNetNav, Go-Browse);
  • Offline RL (WebRL);
  • Inference-time lookahead (ITL).

DynaWeb outperforms all baselines on both benchmarks. On WebArena, average Success Rate (SR) increases from 26.7% (WebRL) to 31.0% (DynaWeb), a 16.1% relative gain; on WebVoyager, SR rises from 32.6% (WebRL) to 38.7%. Highest per-domain results are achieved across Reddit (43.8%), GitLab (28.7%), CMS (31.5%), and Shopping (33.2%) (Ding et al., 29 Jan 2026).

Substituting the finetuned pϕp_\phi with an unfined general-purpose LLM drops SR on WebArena from 31.0% to 20.9% and on WebVoyager from 35.4% to 28.6%, confirming that environment-specific world modeling is indispensable.

5. Relation to Template-based and Extraction Systems

Preceding DynaWeb, frameworks for dynamic web content management and data extraction, such as Vcache (Goyal et al., 2010) and the similarity-based extraction/integration system of DynaWeb [Editor's term: “DynaWeb (Data extraction)”; (C et al., 2013)], operated on fundamentally different principles.

  • Vcache: Decomposes dynamic HTML pages into reusable templates (with <gap> and <loop> tags) and instance-specific bindings. Key features include brute-force and statistical fragmentor algorithms, cache management on the client (by URL or hash), and a language-agnostic architecture. Seen as a blueprint for caching dynamic documents, it eschews string-based similarity in favor of control-flow alignment and achieves large reductions in bandwidth and latency with high cache hit rates, but does not synthesize or simulate user-web interactions (Goyal et al., 2010).
  • DynaWeb (Data extraction): Implements web data crawling (WDES) and record integration (WDICS) using URL-structure and cosine similarity for offline content analysis from search engine result pages. It offers robust precision/recall on structured data mining but is not an RL or interactive agent framework (C et al., 2013).

DynaWeb’s model-based RL paradigm is orthogonal: rather than restructuring or mining static/dynamic documents, it trains interactive agents to navigate and act in new web environments using simulated experience generated by an LLM-driven environment model (Ding et al., 29 Jan 2026).

6. Limitations and Future Directions

DynaWeb’s current limitations center on the fidelity of the web world model:

  • Hallucinated or inaccurate page transitions can occur on highly dynamic or unseen sites (e.g., arXiv, GitHub).
  • Long-horizon rollouts exacerbate simulation drift and error accumulation, requiring careful ablation of horizon length.
  • The present model does not robustly handle multi-tab, multi-agent, or arbitrarily rich UI events; all browser actions are limited to atomic primitives.
  • There is no explicit uncertainty estimation for rollout termination.

Future work includes extending world-model coverage to richer UI actions, training with expanded corpora of real interactions, and incorporating uncertainty-aware partial rollouts. Scaling DynaWeb to multi-agent or concurrent browsing environments is also a significant trajectory (Ding et al., 29 Jan 2026).

A plausible implication is that advances in world model fidelity and agent grounding will accelerate the deployment of safe, scalable, and general-purpose web agents trained entirely in silico, enabling data- and safety-constrained domains to benefit from RL advances without exposing live production systems to risk.

7. Summary Table: DynaWeb vs. Prior Dynamic Web Systems

Framework Primary Paradigm Core Mechanism
DynaWeb (RL) Model-Based RL for agents LLM world model + imagination
Vcache Template caching Fragmentor + template/binding split
DynaWeb (Data extract) Data extraction/integration WDES/WDICS + similarity filters

Each method addresses distinct problems: agentic RL via simulated dreaming (Ding et al., 29 Jan 2026), efficient caching via automatic template decomposition (Goyal et al., 2010), and data mining from SERPs (C et al., 2013). Their combination delineates the landscape of dynamic web content management and intelligent agent interaction.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DynaWeb.