Performance of BAPO on larger-scale LLMs

Determine how Boundary-Aware Policy Optimization (BAPO) performs when applied to larger-scale Large Language Models (exceeding 14B parameters) in agentic search settings, including whether its reliability benefits persist at greater model scales.

Background

The paper introduces Boundary-Aware Policy Optimization (BAPO), an RL framework that augments correctness-based rewards with boundary-aware incentives and an adaptive reward modulator to improve the reliability of agentic search models. BAPO is evaluated on multi-hop question answering benchmarks and demonstrates improved precision and overall reliability while maintaining competitive accuracy.

However, the experiments are constrained to models up to 14B parameters due to computational resource limits. The authors explicitly note that the scalability of BAPO to larger model sizes has not yet been evaluated, motivating a clear open question about performance and reliability at greater scales.

References

It remains to be seen how the proposed method performs on larger-scale LLMs.

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search  (2601.11037 - Liu et al., 16 Jan 2026) in Limitations