M²-Miner: Automated Mobile GUI Data Mining
- The paper introduces a fully automated multi-agent MCTS framework that leverages Vision-LMM for intelligent GUI agent training.
- The paper employs novel intent recycling and process-based reward assignment to enhance data quality, diversity, and reduce annotation costs.
- The paper validates its approach on mobile benchmarks, achieving state-of-the-art performance with exponential speedup and significant cost reduction.
M-Miner is a fully automated mobile GUI agent data-mining framework that leverages a collaborative multi-agent formulation of Monte Carlo Tree Search (MCTS) to harvest large-scale, richly annotated intent-trajectory pairs for intelligent GUI agent training. By integrating Vision-LMM-guided expansion, multistage ranking, process-based reward assignment, and novel intent recycling, M-Miner addresses persistent limitations in manual and conventional mining techniques—namely, annotation cost, data quality, and intent diversity—enabling state-of-the-art GUI agents across standard mobile interaction benchmarks (Lv et al., 5 Feb 2026).
1. Motivation and Core Challenges
Mobile GUI agent training conventionally depends on extensive annotation of user-behavior trajectories, typically represented as intent-trajectory pairs that map natural-language goals to executable interaction traces. Manual annotation methods suffer from prohibitive cost, limited coverage, and inconsistent data quality. Automated mining solutions have failed to reconcile the tradeoff among high construction expense, diminished data quality (spurious or infeasible trajectories), and insufficient intent diversity. M-Miner introduces a fully automated, multi-agent pipeline that systematically mitigates these issues via an MCTS-based data mining paradigm with agent-based guidance, process-driven evaluation, and curriculum-style retraining.
2. System Architecture: Multi-Agent Collaborative MCTS
The M-Miner pipeline instantiates three specialized agents in an MCTS control loop, embedding their computation at critical phases for search guidance, prioritization, and reward assignment:
- InferAgent governs the expansion phase, proposing candidate GUI operations per state-intent pair using an ensemble of two Vision-LMMs (Qwen2.5-VL-7B and Qwen2.5-VL-72B). The first model generates a “seed” action; conditioned on this, the second yields diverse alternatives. This agent replaces random expansion with intent-aware, contextually grounded suggestions, significantly increasing the probability of sampling correct paths.
- OrchestraAgent merges duplicate or semantically equivalent candidate actions (using MLLM-based YES/NO queries), followed by tournament-style ranking to prioritize operations relative to the specified intent. Higher-priority actions receive elevated initial Q/N values in the UCT (Upper Confidence Bound for Trees) calculation, focusing exploration budgets on high-yield branches.
- JudgeAgent replaces full-path rollout with immediate, process-based evaluation at each expansion. It assigns fine-grained rewards to child states using a dual-model architecture: the Outcome Reward Model detects terminality (success, in-progress, impossible), while the Process Reward Model computes normalized scores (via softmax of validity logits) to reflect action plausibility. Rewards are backpropagated as in canonical MCTS, enabling fast convergence to promising trajectories.
The high-level pseudocode for M-Miner's loop is as follows (segment):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
function M2Miner(I₀, s₀):
T ← initialize tree with root (s₀, I₀)
while budget remains and no success path found do
v ← SELECT(T)
Aₖ ← InferAgent.generate(v)
A_sorted ← OrchestraAgent.merge_and_rank(Aₖ)
for a in A_sorted do
s′ ← execute_action(v.s, a)
(reward, terminal) ← JudgeAgent.evaluate(s′)
add child node v′ = (s′, …) to v
BACKPROPAGATE(v′, reward)
if terminal == success break
end for
end while
return successful trajectory if found |
3. Intent Recycling and Data Richness Enhancement
Conventional approaches mine a single trajectory per intent, discarding non-primary paths, which leads to low data richness. M-Miner introduces an intent recycling strategy engineered to maximize coverage:
- After MCTS completion, all root-to-node paths are enumerated.
- A custom Recycling Filter, implemented as an MLLM prompt, discards redundant or illogical paths.
- For each retained path, an MLLM generates a descriptive natural-language intent.
- JudgeAgent verifies the intent’s alignment with the path’s terminal state.
- Accepted intent-trajectory pairs augment the dataset without incurring additional exploration cost.
This strategy has been observed to yield a 2–3× increase in unique, high-quality intent-trajectory pairs, as quantified by t-SNE visualization and trajectory count (Lv et al., 5 Feb 2026).
4. Progressive Model-in-the-Loop Retraining
M-Miner employs a curriculum learning regime that alternates between dataset mining and agent retraining in three sequential stages:
- Warm-up: Agents (InferAgent, JudgeAgent) are initialized on public datasets (AndroidControl, AITZ, GUI-Odyssey, AMEX), achieving basic GUI understanding.
- Basic Intents: Generation and collection of simple home-screen tasks, retraining agents on aggregated data.
- Complex Intents: Mining of multi-condition, longer-horizon tasks, followed by retraining.
- Recycled Intents: Application of the intent recycling strategy to maximize data diversity.
Each curriculum phase raises both the Mining Success Ratio (MSR) and classifier AUC (JudgeAgent), with MSR increasing from ∼30% post-warm-up to >75% after full training. A plausible implication is that curriculum-based agent retraining is essential for practical scaling of automated mining pipelines.
5. Experimental Evaluation and Results
Extensive evaluation was conducted on a dedicated VM cluster executing Android API-36 emulators, with the system comprising an MCTS controller, multi-agent layer, execution handler, and environment interface.
Dataset statistics:
- 20K screenshots, 2.6K mined trajectories, mean length 7.8.
- Annotation cost: \$466 total (\$0.02/screenshot), a reduction by 18× over manual techniques.
- Human-reviewed Data Quality Accuracy (DQA): 71% (n=100 sample).
Benchmarks:
- AndroidControl (AC-Low, AC-High), AITZ, GUI-Odyssey, CAGUI.
Metrics:
- TP (action-type accuracy)
- SR (step success rate: action type and parameter match)
- MSR (mining success ratio)
- DQA (trajectory correctness)
| Benchmark | TP (%) | SR (%) | Note |
|---|---|---|---|
| AC-Low | 97.5 | 93.5 | SOTA |
| AC-High | 81.8 | 72.9 | Best SR |
| AITZ | 81.3 | 69.4 | SOTA |
| GUI-Odyssey | 90.5 | 79.3 | SOTA |
| CAGUI (zero-shot) | 88.8 | 70.2 | +15 pp over public fine-tunes |
M-Miner outperforms both automated-mining baselines (OS-Genesis-7B, GUI-Owl-7B) and private data approaches (UI-TARS-7B) (Lv et al., 5 Feb 2026).
Ablation studies showed that integrating OrchestraAgent and JudgeAgent with vanilla MCTS+InferAgent provides an exponential speedup (×64 at depth 9). Inclusion of semantic and preference data in trajectories increases TP/SR modestly, and expanding the training data mix with auto-mined samples provides an additional TP boost (+3.9 pp) and SR gain (+5.8 pp).
6. Implementation Details
Core MCTS Formulation
- States, S: All possible GUI screenshots.
- Actions, A(s): Discrete GUI operations suitable for Android environments.
- Transition: Deterministic, based on ADB instrumentation.
- Reward:
- Terminal: if success, 0 if failure.
- Intermediate: Process-reward from JudgeAgent.
- UCT Selection: , with tunable.
- Phases:
- Selection: Standard UCT.
- Expansion: Replaces random action proposal with agent ensemble.
- Simulation: Replaced by reward assignment from JudgeAgent.
- Backpropagation: Standard, using JudgeAgent-derived rewards.
Infrastructure
- VM android clusters, orchestrated by the multi-agent MCTS controller.
- API-36 emulator via ADB as a deterministic transition oracle.
- Integration of Vision-LMMs (Qwen2.5-VL-7B/72B), MLLM-based action equivalence/ranking, and model-based path judgment.
7. Significance, Limitations, and Future Directions
M-Miner establishes a data mining paradigm that leverages MCTS and collaborative agent interaction for low-cost, high-diversity, and high-quality dataset generation in mobile GUI domains. The framework consistently yields state-of-the-art GUI agents on canonical benchmarks, outperforming automated and private-data mining baselines. Cost-effectiveness is demonstrated by a reduction of per-instance cost from $\sim\$0.36\$0.02^2$-Miner), and the released M2-Miner-Agent dataset and codebase are positioned to support further community developments (Lv et al., 5 Feb 2026).
Anticipated extensions include adaptation to desktop and web-based GUIs with dynamic layouts, integration of multimodal (voice/gesture) feedback for enriched intent annotation, and the adoption of continuous learning schemes powered by live user logs to mine evolving interaction patterns. A plausible implication is that the agent-guided, MCTS-centered approach will generalize to broader HCI domains as model-centric mining methodologies mature.