NestBrowse: Nested Browser-Use Learning
- NestBrowse is a browser-driven framework that decouples high-level planning from low-level page interactions, minimizing actions for efficient multi-step task execution.
- It employs a two-level nested policy structure with outer-loop planning and inner-loop evidence extraction to handle both static and dynamic web content.
- Empirical benchmarks show competitive performance with reduced context overhead and enhanced privacy through client-side clickstream modeling.
Nested Browser-Use Learning (NestBrowse) is a browser-driven framework designed for agentic information-seeking with deep, dynamic web content acquisition. By minimalizing the set of browser actions and decoupling high-level agentic planning from low-level evidence extraction, NestBrowse enables efficient and robust multi-step task execution, surpassing conventional API-based approaches. This architecture operates on nested interaction loops, leverages imitation learning objectives, and integrates seamlessly with modern LLMs and privacy-preserving client-side models.
1. Motivation and Core Challenges
Information-seeking agents have exhibited strong performance in open-domain and multi-hop QA when constrained to APIs that primarily support surface-level search and URL fetch operations. These systems are limited by the inability to interact with dynamic content that typically requires user actions such as clicks and form-fills, and they suffer from context overflows due to the injection of raw page snapshots into model input pipelines. Raw HTML can exceed 100K–200K tokens—far beyond even recent context limits for LLMs (128K–256K). Moreover, ReAct-style function-calling agents become inefficient and error-prone when confronted with the highly redundant, verbose structure of real web pages.
NestBrowse addresses these problems by introducing a minimal and sufficient browser toolkit, which supports search, visit, click, and fill actions. Crucially, it separates high-level planning (outer loop) from fine-grained, goal-specific page exploration (inner loop), handling both static and dynamic content. This nested decomposition enables LLM agents to maintain tractable context, allow deeper interaction, and compress response content to only what is relevant for the current goal (Li et al., 29 Dec 2025).
A closely related formulation is found in client-side user modeling: clickstreams are treated as “Action Paths” consisting of nested, time-ordered sequences that annotate multi-tab branching and backtracking decisions with explicit special tokens. In this context, NestBrowse’s structure has proven effective for classifying user strategies and predicting future actions while preserving privacy (Ou et al., 2021).
2. Formal Framework and Definition
NestBrowse adopts a two-level nested policy structure:
- Browser toolkit: 𝒯 = {search, visit, click, fill}; 𝒯_page ⊂ 𝒯 includes page-transitional actions (visit, click).
- Outer loop: At step , the agent context is updated with tuple , where is the tool, its arguments, and the tool response. The selection is modeled as .
- Nested execution: If , the agent runs an inner loop extracting only goal-relevant content from segments of page text using ; otherwise, it returns Exec_base.
- Multi-task loss: Joint learning objective , balancing outer-loop agentic reasoning with inner-loop evidence extraction.
For clickstream behavior modeling, an Action Path is an ordered sequence of (URL, dwell time) pairs, punctuated by <COI> markers for tab-branching, and classified at end-of-action with tokens for targeted, purposive, and explorative browsing (Ou et al., 2021).
3. Algorithmic Procedures and Implementation
The core algorithm operates as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Initialize context c0 ← [<UserGoal=g>] t ← 0 while not Termination(c_t): (a_t, η_t) ← p_θ(· | c_t) if a_t ∈ {visit, click}: RawPage ← BrowserBaseExec(a_t, η_t).text Segments ← SegmentPage(RawPage) Workspace ← ∅ for i in 1..N: Δ ← f_θ(Segments[i], g_t) Workspace ← Workspace ∪ Δ r_t ← Workspace else: r_t ← BrowserBaseExec(a_t, η_t) c_{t+1} ← Append(c_t, <tool_call a_t, η_t>, <tool_resp r_t>) t ← t + 1 Answer ← p_θ(Answer | c_t) Return <answer>Answer</answer> |
Technical implementation features include:
- Browser engine: Headless Playwright for page loads/rendering/interaction; HTML is converted to semantic DOM snapshots and then LLM-readable text.
- Toolkit minimalism: Search yields top-10 hits; visit and click trigger inner-loop extraction; fill acts on forms (usually omitting a nested loop).
- Mark-up conventions: Outer loop (ReAct-style tags: > , <tool_call>, <tool_response>); inner loop (<useful_info>).
Models: Qwen3-4B and Qwen3-30B-A3B (128K context, max 100 tool calls).
- Prompting: Separate system/user prompts for outer/inner loops.
Client-side clickstream systems employ browser extensions to record anonymized behavioral tuples (local IndexedDB logs), with an over-parameterized GRU for sequence modeling and next-action prediction (Ou et al., 2021).
GUI-based browser agents (BUI-BERT) in multi-modal setups encode screenshot pixels, detected sub-word locations, past actions, and Transformer memory into the input sequence. Action heads predict movement, clicks, or text entries; page transitions are orchestrated via recurrent memory tokens, with cross-entropy loss on each step (Iki et al., 2022).
4. Empirical Benchmarks and Analysis
NestBrowse is evaluated on deep web QA benchmarks:
Model / Toolkit BrowseComp BrowseComp-zh GAIA XBench OpenAI-o3 (browser) 49.7 58.1 70.5 66.7 NestBrowse-4B (browser) 22.4 28.4 68.9 74.0 NestBrowse-30B-A3B 31.6 42.6 75.7 75.0 NestBrowse-30B-A3B consistently outperforms all known open-source agents and is competitive with closed-source browser agents. Notably, even smaller agents (4B) deliver near-state-of-the-art results with appropriate architecture.
Ablation studies confirm that both toolkit minimalism and goal-driven extraction are necessary for high accuracy. Context compression is efficient, injecting only ≈5K tokens of goal-relevant data per page, keeping overall context far below the agent’s 128K token capacity. Inner-loop extraction, judged by GPT-4.1, raises raw snapshot retention from 50% to 80% and goal-relevant extraction accuracy from 65% to 90%.
In lab studies of user browsing behaviors, a GRU model achieves 100% accuracy on behavior classification from annotated clickstreams and >60% top-1 prediction of next actions when given 60–80% of the session context (Ou et al., 2021). Five distinct graph motifs (concentrated cluster, hesitation leaf, directed ring, breadth star, intersected overlap) encapsulate the diversity of real nested browsing.
5. Architecture Variants: Multi-modal and Client-side Extensions
Unified vision-and-language browser agents (BUI-BERT) extend NestBrowse principles to GUI manipulation. Pixel grids (ResNet18 features), OCR-emulated subword detection, previous action encoding, and memory buffer tokens are fused into the input of a Transformer-based model. Task execution unfolds over multi-step action sequences: mouse moves, clicks, text entry, and scrolls, as annotated in Selenium-backed virtual web tasks (Iki et al., 2022).
Ablation analyses show that both atomic-action pre-training and recurrent memory are critical for mastering nested, cross-page routines. Accuracy on seen multi-page tasks approaches 75%, but generalization to unseen UI layouts and question types is weak; the agent tends to memorize patterns of interaction rather than abstract planning. This suggests future work should prioritize domain-agnostic structure signals (DOM element types, direct screenshot vision) and RL-style exploration to improve zero-shot robustness.
In purely client-side environments, action path modeling—with url2vec embeddings and bias-injected GRUs—enables real-time privacy-preserving prediction and interpretation of browsing habits. Such systems neither upload nor centrally process user URLs or page content, maintaining privacy guarantees while supporting user interface augmentation (Ou et al., 2021).
6. Key Insights, Limitations, and Future Directions
NestBrowse's nested loop paradigm offers:
- Efficiency: Context compression enables deep exploration without exceeding LLM limits; minimal toolkits restrict the action space yet preserve full interaction coverage.
- Flexibility: Agents handle interactive utilities (calculators, tabular filters), with small-scale models delivering competitive results.
- Robustness: Decoupled interaction flow circumvents typical failure modes of raw HTML injection and redundant context overload.
Limitations are apparent in the current text-only implementation, with multimodal content (images, charts) unaddressed. Vision-language inner loops, direct screenshot parsing, and hybrid RL+SL methods may address the generalization gaps found in BUI-BERT tasks, improving planning abstraction and domain transferability (Iki et al., 2022, Li et al., 29 Dec 2025). On-device continual learning represents a promising avenue for privacy-preserving, adaptive clickstream modeling (Ou et al., 2021).
A plausible implication is that hierarchical policy architectures, with explicit separation between high-level planning and low-level action controllers, could further enhance model scalability and transfer capacity. The foundational structure of NestBrowse aligns with emerging trends in browser-based LLM agents as well as client-side behavioral modeling, bridging the gap between human-level information seeking and LLM-driven browsing automation.