Robust and calibrated stop/continue criteria across retrievers, corpora, and LLM backbones

Develop stop/continue criteria for multi-hop question answering retrieval–reasoning procedures that remain calibrated across different retrievers, corpora, and large language model backbones, and evaluate these criteria under controlled variations of hop depth and retrieval noise to assess their reliability and transferability.

Background

The survey highlights that most multi-hop QA systems rely on static hop, token, or latency budgets to decide when to stop, while adaptive stopping based on confidence or sufficiency is evaluated only on limited benchmarks and is often poorly calibrated under distribution shift. This lack of robust stop/continue criteria leads to both over-search and under-search, harming reliability and efficiency.

The authors call for methods whose stop/continue decisions generalize across different retrievers, corpora, and LLM backbones, together with evaluation protocols that deliberately vary hop depth and retrieval noise to stress-test calibration. This problem sits within the broader axis of procedural design choices for retrieval–reasoning agents and directly impacts effectiveness, efficiency, and faithfulness.

References

An important open problem is to develop stop/continue criteria that remain calibrated across retrievers, corpora, and LLM backbones, and to evaluate them under controlled variations of hop depth and retrieval noise.

Retrieval--Reasoning Processes for Multi-hop Question Answering: A Four-Axis Design Framework and Empirical Trends  (2601.00536 - Ji et al., 2 Jan 2026) in Section RQ4: Open Problems and Future Directions, Challenge 4