Tool Benefit Score in Multimodal Models
- Tool Benefit Score is a metric that quantifies the net benefit of vision-tool interactions by measuring the accuracy difference between tool-enabled and text-only reasoning.
- It integrates with reinforcement learning, particularly in AdaTooler-V, to modulate rewards and adaptively encourage effective tool use while mitigating unnecessary calls.
- Empirical results demonstrate that TBS improves visual reasoning performance and inference efficiency by aligning tool invocation with task-specific accuracy gains.
The Tool Benefit Score (TBS) is a per-instance metric employed to quantify the net value of tool use—specifically, vision-tool interactions—in multimodal LLMs (MLLMs) engaged in complex visual reasoning tasks. It is formally defined for each problem instance as the difference in expected accuracy between a reference model allowed vision-tool calls during chain-of-thought (CoT) reasoning and the same model restricted to textual reasoning alone. TBS serves as a foundational component in the training and operation of AdaTooler-V, where it both diagnoses when tool invocations help or hinder performance and modulates reinforcement learning objectives to optimize tool-use adaptively (Wang et al., 18 Dec 2025).
1. Formal Definition and Computation
For each query , the Tool Benefit Score is defined as:
where
- is the average accuracy of the reference model (Qwen2.5-VL-72B-Instruct) on with interleaved vision tool invocations,
- is the corresponding accuracy with vision tools disabled, i.e., pure text-based CoT.
Empirically,
where is an indicator of correctness for the -th run. thus resides in , with no further clipping or normalization. TBS is precomputed for each RL training sample via 8 runs in both tool-enabled and tool-disabled regimes to minimize sampling noise.
2. Theoretical Motivation and Significance
TBS directly addresses the tendency of previous MLLMs to invoke vision tools indiscriminately, which can result in:
- Increased inference costs,
- Overly complex or suboptimal reasoning chains,
- A diversion from salient visual inputs.
acts as a problem-dependent indicator: signals that vision tool use yields net gains while indicates harm from such invocations. Unlike static or global tool-use policies, TBS supports per-instance assessment, enabling reinforcement learners to reward tool usage only when historically associated with accuracy improvements and penalize it when detrimental. This approach directly counteracts "blind" tool-use patterns and aligns the reward structure with task-specific needs (Wang et al., 18 Dec 2025).
3. Integration into AT-GRPO Reinforcement Learning
Within AdaTooler-V’s RL framework, the AT-GRPO algorithm incorporates TBS in its reward function. For trajectory , the reward is composed as:
where
- is the base reward (correctness, formatting),
- is a tool-use reward,
is the number of tool calls, is the allowed maximum (context-budgeted), and (set to $2$) governs decay sensitivity. trades off tool and base reward, with ablation confirming optimum performance at for key benchmarks.
Within minibatches, returns are normalized:
and policy updates follow clipped GRPO with KL penalty. This integration ensures that tool-use is modulated not only by its marginal benefit but also by the efficiency with respect to tool budget.
4. Hyperparameter Choices and Tuning
The principal hyperparameters for TBS-guided RL are:
- for Gaussian decay,
- explored in , showing peak performance at $0.6$,
- typically set between 4–6, determined by context length or task constraints.
Performance was robust to provided it exceeded usual tool demand. The TBS mechanism’s efficacy persisted for a range of values, with stable gains for .
5. Empirical Outcomes and Analysis
In empirical evaluation:
- SFT+AT-GRPO (with TBS) outperformed SFT+GRPO and vanilla GRPO, yielding improvements of +2.2 pp on V*, +2.4 pp average, and +4 pp over standard GRPO.
- Complete tool-use ablation degraded accuracy on V* from 89.8% to 84.4% and on VSI-Bench from 46.7% to 39.9%, confirming the necessity of selective tool invocation.
- Training dynamics (Figure 1(b)) exhibited a reduction in response lengths, indicating that the model adaptively suppressed tool use when .
This suggests TBS is instrumental both for accuracy maximization and inference cost reduction through avoidance of wasteful tool calls.
6. Qualitative Examples of TBS Guidance
Concrete examples illustrate TBS’s modulatory effect:
- High-resolution image tasks (V*): Large positive ( +0.2–0.4) led to multiplicative tool invocation, as AdaTooler-V exploited cropping/zooming to recover fine details missed by baselines.
- Multi-image spatial puzzles: Negative (≈–0.1) punished tool use, reducing invocations to zero, improving efficiency with no loss—sometimes a net gain—in accuracy.
- Video causal reasoning (Video-Holmes): Moderately positive ( +0.15) encouraged strategic tool use early in CoT, bolstering accuracy by facilitating frame-level analysis.
These case studies indicate that TBS-driven policies adapt tool frequency and timing to the actual utility displayed in historical completions.
7. Significance and Broader Implications
The Tool Benefit Score paradigm presents a scalable, generalizable mechanism for per-query tool-use selection in multimodal reasoning. By weaving into AT-GRPO’s reward, AdaTooler-V achieves a dual optimization of reasoning quality and inference efficiency, sidestepping both underuse and overuse of vision tools. A plausible implication is that such adaptive reward scaling could be extended to other MLLM tool-use scenarios beyond vision (e.g., audio, structured data), provided domain-specific analogues of TBS are devised.
TBS thus enables a principled, data-driven approach to selective tool invocation, providing both theoretical foundation and empirical validation for more efficient, higher-performing multimodal LLMs (Wang et al., 18 Dec 2025).