M²-Miner: Automated Mobile GUI Data Mining

Updated 9 February 2026

The paper introduces a fully automated multi-agent MCTS framework that leverages Vision-LMM for intelligent GUI agent training.
The paper employs novel intent recycling and process-based reward assignment to enhance data quality, diversity, and reduce annotation costs.
The paper validates its approach on mobile benchmarks, achieving state-of-the-art performance with exponential speedup and significant cost reduction.

M $^2$ -Miner is a fully automated mobile GUI agent data-mining framework that leverages a collaborative multi-agent formulation of Monte Carlo Tree Search (MCTS) to harvest large-scale, richly annotated intent-trajectory pairs for intelligent GUI agent training. By integrating Vision-LMM-guided expansion, multistage ranking, process-based reward assignment, and novel intent recycling, M $^2$ -Miner addresses persistent limitations in manual and conventional mining techniques—namely, annotation cost, data quality, and intent diversity—enabling state-of-the-art GUI agents across standard mobile interaction benchmarks (Lv et al., 5 Feb 2026).

1. Motivation and Core Challenges

Mobile GUI agent training conventionally depends on extensive annotation of user-behavior trajectories, typically represented as intent-trajectory pairs that map natural-language goals to executable interaction traces. Manual annotation methods suffer from prohibitive cost, limited coverage, and inconsistent data quality. Automated mining solutions have failed to reconcile the tradeoff among high construction expense, diminished data quality (spurious or infeasible trajectories), and insufficient intent diversity. M $^2$ -Miner introduces a fully automated, multi-agent pipeline that systematically mitigates these issues via an MCTS-based data mining paradigm with agent-based guidance, process-driven evaluation, and curriculum-style retraining.

2. System Architecture: Multi-Agent Collaborative MCTS

The M $^2$ -Miner pipeline instantiates three specialized agents in an MCTS control loop, embedding their computation at critical phases for search guidance, prioritization, and reward assignment:

InferAgent governs the expansion phase, proposing $K$ candidate GUI operations per state-intent pair using an ensemble of two Vision-LMMs (Qwen2.5-VL-7B and Qwen2.5-VL-72B). The first model generates a “seed” action; conditioned on this, the second yields $K-1$ diverse alternatives. This agent replaces random expansion with intent-aware, contextually grounded suggestions, significantly increasing the probability of sampling correct paths.
OrchestraAgent merges duplicate or semantically equivalent candidate actions (using MLLM-based YES/NO queries), followed by tournament-style ranking to prioritize operations relative to the specified intent. Higher-priority actions receive elevated initial Q/N values in the UCT (Upper Confidence Bound for Trees) calculation, focusing exploration budgets on high-yield branches.
JudgeAgent replaces full-path rollout with immediate, process-based evaluation at each expansion. It assigns fine-grained rewards to child states using a dual-model architecture: the Outcome Reward Model detects terminality (success, in-progress, impossible), while the Process Reward Model computes normalized scores (via softmax of validity logits) to reflect action plausibility. Rewards are backpropagated as in canonical MCTS, enabling fast convergence to promising trajectories.

The high-level pseudocode for M $^2$ -Miner's loop is as follows (segment):

function M2Miner(I₀, s₀):
    T ← initialize tree with root (s₀, I₀)
    while budget remains and no success path found do
        v ← SELECT(T)
        Aₖ ← InferAgent.generate(v)
        A_sorted ← OrchestraAgent.merge_and_rank(Aₖ)
        for a in A_sorted do
            s′ ← execute_action(v.s, a)
            (reward, terminal) ← JudgeAgent.evaluate(s′)
            add child node v′ = (s′, …) to v
            BACKPROPAGATE(v′, reward)
            if terminal == success break
        end for
    end while
    return successful trajectory if found

In this agent-in-the-loop MCTS, the environment is defined by the VM-controlled Android API-36 emulator, GUI actions (click, swipe, type, long_press, wait, key, system_button, open, terminate), and deterministic transition model via ADB interface.

3. Intent Recycling and Data Richness Enhancement

Conventional approaches mine a single trajectory per intent, discarding non-primary paths, which leads to low data richness. M $^2$ -Miner introduces an intent recycling strategy engineered to maximize coverage:

After MCTS completion, all root-to-node paths are enumerated.
A custom Recycling Filter, implemented as an MLLM prompt, discards redundant or illogical paths.
For each retained path, an MLLM generates a descriptive natural-language intent.
JudgeAgent verifies the intent’s alignment with the path’s terminal state.
Accepted intent-trajectory pairs augment the dataset without incurring additional exploration cost.

This strategy has been observed to yield a 2–3× increase in unique, high-quality intent-trajectory pairs, as quantified by t-SNE visualization and trajectory count (Lv et al., 5 Feb 2026).

4. Progressive Model-in-the-Loop Retraining

M $^2$ -Miner employs a curriculum learning regime that alternates between dataset mining and agent retraining in three sequential stages:

Warm-up: Agents (InferAgent, JudgeAgent) are initialized on public datasets (AndroidControl, AITZ, GUI-Odyssey, AMEX), achieving basic GUI understanding.
Basic Intents: Generation and collection of simple home-screen tasks, retraining agents on aggregated data.
Complex Intents: Mining of multi-condition, longer-horizon tasks, followed by retraining.
Recycled Intents: Application of the intent recycling strategy to maximize data diversity.

Each curriculum phase raises both the Mining Success Ratio (MSR) and classifier AUC (JudgeAgent), with MSR increasing from ∼30% post-warm-up to >75% after full training. A plausible implication is that curriculum-based agent retraining is essential for practical scaling of automated mining pipelines.

5. Experimental Evaluation and Results

Extensive evaluation was conducted on a dedicated VM cluster executing Android API-36 emulators, with the system comprising an MCTS controller, multi-agent layer, execution handler, and environment interface.

Dataset statistics:

20K screenshots, 2.6K mined trajectories, mean length 7.8.
Annotation cost: \$466 total (\$0.02/screenshot), a reduction by 18× over manual techniques.
Human-reviewed Data Quality Accuracy (DQA): 71% (n=100 sample).

Benchmarks:

AndroidControl (AC-Low, AC-High), AITZ, GUI-Odyssey, CAGUI.

Metrics:

TP (action-type accuracy)
SR (step success rate: action type and parameter match)
MSR (mining success ratio)
DQA (trajectory correctness)

Benchmark	TP (%)	SR (%)	Note
AC-Low	97.5	93.5	SOTA
AC-High	81.8	72.9	Best SR
AITZ	81.3	69.4	SOTA
GUI-Odyssey	90.5	79.3	SOTA
CAGUI (zero-shot)	88.8	70.2	+15 pp over public fine-tunes

M $^2$ -Miner outperforms both automated-mining baselines (OS-Genesis-7B, GUI-Owl-7B) and private data approaches (UI-TARS-7B) (Lv et al., 5 Feb 2026).

Ablation studies showed that integrating OrchestraAgent and JudgeAgent with vanilla MCTS+InferAgent provides an exponential speedup (×64 at depth 9). Inclusion of semantic and preference data in trajectories increases TP/SR modestly, and expanding the training data mix with auto-mined samples provides an additional TP boost (+3.9 pp) and SR gain (+5.8 pp).

6. Implementation Details

Core MCTS Formulation

States, S: All possible GUI screenshots.
Actions, A(s): Discrete GUI operations suitable for Android environments.
Transition: Deterministic, based on ADB instrumentation.
Reward:
- Terminal: $R=1$ if success, 0 if failure.
- Intermediate: Process-reward $R_\text{intermediate} = \exp(l_\mathrm{valid})/[ \exp(l_\mathrm{valid}) + \exp(l_\mathrm{invalid}) ]$ from JudgeAgent.
UCT Selection: $\text{UCT}(s,a) = Q(s,a)/N(s,a) + c\cdot\sqrt{\ln N(s)/N(s,a)}$ , with $c$ tunable.
Phases:
- Selection: Standard UCT.
- Expansion: Replaces random action proposal with agent ensemble.
- Simulation: Replaced by reward assignment from JudgeAgent.
- Backpropagation: Standard, using JudgeAgent-derived rewards.

Infrastructure

VM android clusters, orchestrated by the multi-agent MCTS controller.
API-36 emulator via ADB as a deterministic transition oracle.
Integration of Vision-LMMs (Qwen2.5-VL-7B/72B), MLLM-based action equivalence/ranking, and model-based path judgment.

7. Significance, Limitations, and Future Directions

M $^2$ -Miner establishes a data mining paradigm that leverages MCTS and collaborative agent interaction for low-cost, high-diversity, and high-quality dataset generation in mobile GUI domains. The framework consistently yields state-of-the-art GUI agents on canonical benchmarks, outperforming automated and private-data mining baselines. Cost-effectiveness is demonstrated by a reduction of per-instance cost from $\sim\$0.36 $(manual) to$ \$0.02 $(M$ ^2$-Miner), and the released M2-Miner-Agent dataset and codebase are positioned to support further community developments (Lv et al., 5 Feb 2026).

Anticipated extensions include adaptation to desktop and web-based GUIs with dynamic layouts, integration of multimodal (voice/gesture) feedback for enriched intent annotation, and the adoption of continuous learning schemes powered by live user logs to mine evolving interaction patterns. A plausible implication is that the agent-guided, MCTS-centered approach will generalize to broader HCI domains as model-centric mining methodologies mature.

Markdown Report Issue Upgrade to Chat

References (1)

M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to M$^2$-Miner.