Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tool Benefit Score in Multimodal Models

Updated 3 February 2026
  • Tool Benefit Score is a metric that quantifies the net benefit of vision-tool interactions by measuring the accuracy difference between tool-enabled and text-only reasoning.
  • It integrates with reinforcement learning, particularly in AdaTooler-V, to modulate rewards and adaptively encourage effective tool use while mitigating unnecessary calls.
  • Empirical results demonstrate that TBS improves visual reasoning performance and inference efficiency by aligning tool invocation with task-specific accuracy gains.

The Tool Benefit Score (TBS) is a per-instance metric employed to quantify the net value of tool use—specifically, vision-tool interactions—in multimodal LLMs (MLLMs) engaged in complex visual reasoning tasks. It is formally defined for each problem instance as the difference in expected accuracy between a reference model allowed vision-tool calls during chain-of-thought (CoT) reasoning and the same model restricted to textual reasoning alone. TBS serves as a foundational component in the training and operation of AdaTooler-V, where it both diagnoses when tool invocations help or hinder performance and modulates reinforcement learning objectives to optimize tool-use adaptively (Wang et al., 18 Dec 2025).

1. Formal Definition and Computation

For each query qiq_i, the Tool Benefit Score ΔSi\Delta S_i is defined as:

ΔSi=S+(qi)S(qi)\Delta S_i = S^{+}(q_i) - S^{-}(q_i)

where

  • S+(qi)S^{+}(q_i) is the average accuracy of the reference model (Qwen2.5-VL-72B-Instruct) on qiq_i with interleaved vision tool invocations,
  • S(qi)S^{-}(q_i) is the corresponding accuracy with vision tools disabled, i.e., pure text-based CoT.

Empirically,

S+(qi)=18r=181{correctr+},S(qi)=18r=181{correctr}S^{+}(q_i)=\frac{1}{8}\sum_{r=1}^{8}\mathbf{1}\{\text{correct}_r^{+}\},\quad S^{-}(q_i)=\frac{1}{8}\sum_{r=1}^{8}\mathbf{1}\{\text{correct}_r^{-}\}

where 1{}\mathbf{1}\{\cdot\} is an indicator of correctness for the rr-th run. ΔSi\Delta S_i thus resides in [1,1][-1, 1], with no further clipping or normalization. TBS is precomputed for each RL training sample via 8 runs in both tool-enabled and tool-disabled regimes to minimize sampling noise.

2. Theoretical Motivation and Significance

TBS directly addresses the tendency of previous MLLMs to invoke vision tools indiscriminately, which can result in:

  • Increased inference costs,
  • Overly complex or suboptimal reasoning chains,
  • A diversion from salient visual inputs.

ΔSi\Delta S_i acts as a problem-dependent indicator: ΔSi>0\Delta S_i > 0 signals that vision tool use yields net gains while ΔSi<0\Delta S_i < 0 indicates harm from such invocations. Unlike static or global tool-use policies, TBS supports per-instance assessment, enabling reinforcement learners to reward tool usage only when historically associated with accuracy improvements and penalize it when detrimental. This approach directly counteracts "blind" tool-use patterns and aligns the reward structure with task-specific needs (Wang et al., 18 Dec 2025).

3. Integration into AT-GRPO Reinforcement Learning

Within AdaTooler-V’s RL framework, the AT-GRPO algorithm incorporates TBS in its reward function. For trajectory ii, the reward is composed as:

Ri=Rio+αRitR_i = R_i^o + \alpha R_i^t

where

  • RioR_i^o is the base reward (correctness, formatting),
  • RitR_i^t is a tool-use reward,

Rit=ΔSiexp(γ(ntool,inmax)2)R_i^t = \Delta S_i \exp\left(-\gamma \left(\frac{n_{\text{tool},i}}{n_{\max}}\right)^2\right)

ntool,in_{\text{tool},i} is the number of tool calls, nmaxn_{\max} is the allowed maximum (context-budgeted), and γ\gamma (set to $2$) governs decay sensitivity. α\alpha trades off tool and base reward, with ablation confirming optimum performance at α=0.6\alpha = 0.6 for key benchmarks.

Within minibatches, returns are normalized:

Ai=RiμRσRA_i=\frac{R_i-\mu_R}{\sigma_R}

and policy updates follow clipped GRPO with KL penalty. This integration ensures that tool-use is modulated not only by its marginal benefit but also by the efficiency with respect to tool budget.

4. Hyperparameter Choices and Tuning

The principal hyperparameters for TBS-guided RL are:

  • γ=2\gamma = 2 for Gaussian decay,
  • α\alpha explored in {0.2,0.4,0.6,0.8}\{0.2, 0.4, 0.6, 0.8\}, showing peak performance at $0.6$,
  • nmaxn_{\max} typically set between 4–6, determined by context length or task constraints.

Performance was robust to nmaxn_{\max} provided it exceeded usual tool demand. The TBS mechanism’s efficacy persisted for a range of α\alpha values, with stable gains for α[0.4,0.8]\alpha \in [0.4, 0.8].

5. Empirical Outcomes and Analysis

In empirical evaluation:

  • SFT+AT-GRPO (with TBS) outperformed SFT+GRPO and vanilla GRPO, yielding improvements of +2.2 pp on V*, +2.4 pp average, and +4 pp over standard GRPO.
  • Complete tool-use ablation degraded accuracy on V* from 89.8% to 84.4% and on VSI-Bench from 46.7% to 39.9%, confirming the necessity of selective tool invocation.
  • Training dynamics (Figure 1(b)) exhibited a reduction in response lengths, indicating that the model adaptively suppressed tool use when ΔSi<0\Delta S_i < 0.

This suggests TBS is instrumental both for accuracy maximization and inference cost reduction through avoidance of wasteful tool calls.

6. Qualitative Examples of TBS Guidance

Concrete examples illustrate TBS’s modulatory effect:

  • High-resolution image tasks (V*): Large positive ΔS\Delta S (\approx +0.2–0.4) led to multiplicative tool invocation, as AdaTooler-V exploited cropping/zooming to recover fine details missed by baselines.
  • Multi-image spatial puzzles: Negative ΔS\Delta S (≈–0.1) punished tool use, reducing invocations to zero, improving efficiency with no loss—sometimes a net gain—in accuracy.
  • Video causal reasoning (Video-Holmes): Moderately positive ΔS\Delta S (\approx +0.15) encouraged strategic tool use early in CoT, bolstering accuracy by facilitating frame-level analysis.

These case studies indicate that TBS-driven policies adapt tool frequency and timing to the actual utility displayed in historical completions.

7. Significance and Broader Implications

The Tool Benefit Score paradigm presents a scalable, generalizable mechanism for per-query tool-use selection in multimodal reasoning. By weaving ΔS\Delta S into AT-GRPO’s reward, AdaTooler-V achieves a dual optimization of reasoning quality and inference efficiency, sidestepping both underuse and overuse of vision tools. A plausible implication is that such adaptive reward scaling could be extended to other MLLM tool-use scenarios beyond vision (e.g., audio, structured data), provided domain-specific analogues of TBS are devised.

TBS thus enables a principled, data-driven approach to selective tool invocation, providing both theoretical foundation and empirical validation for more efficient, higher-performing multimodal LLMs (Wang et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tool Benefit Score (TBS).