- The paper introduces a retrial mechanism that boosts LLM reasoning accuracy by allowing simple reattempts without explicit feedback.
- The study evaluates this strategy against complex approaches like Tree-of-Thoughts and Reflexion on benchmarks such as Game of 24, HumanEval, and HotpotQA.
- Results indicate that task type, model strength, and sampling temperature critically influence the success and cost-efficiency of retrial-enhanced reasoning.
This paper introduces a simple yet effective mechanism called "retrials without feedback" to enhance the reasoning capabilities of LLMs (2504.12951). It challenges the trend of increasingly complex and computationally expensive reasoning frameworks, such as Tree-of-Thoughts (ToT) (Yao et al., 2023) and Reflexion (Shinn et al., 2023), which often rely on iterative refinement through self-evaluation and verbalized feedback. These complex methods incur significant costs due to the added computation and expanding context windows required for feedback generation and processing.
The core idea proposed is straightforward: if an LLM's generated answer to a problem is identified as incorrect (using a simple verifier), the LLM is prompted to simply try solving the problem again from scratch, without any specific feedback on why the previous attempt failed. This process repeats until a correct solution is found or a predefined computational budget (e.g., cost or number of attempts) is exhausted.
The authors compare this retrial mechanism applied to simple prompting strategies like standard Input-Output (IO) and Chain-of-Thought (CoT) (Wei et al., 2022) against more sophisticated methods like ToT and Reflexion. They evaluate these approaches on three benchmark tasks: Game of 24 (mathematical reasoning), HumanEval (code generation), and HotpotQA (multi-hop question answering), using GPT-4o-mini and LLaMA-3.3-70B as base LLMs. Performance is measured by task success rate or accuracy (quality) and computational cost in USD (efficiency).
Key findings indicate that:
- Cost-Efficiency: Simpler methods, particularly CoT, augmented with the retrial mechanism, often achieve comparable or superior performance to more complex methods like ToT and Reflexion within a fixed budget. For instance, on Game of 24, CoT with retrials reached high accuracy at a fraction of the cost reported for state-of-the-art refinement methods using more powerful models.
- Task and Model Dependency: The relative effectiveness and cost-efficiency vary depending on the task and the base LLM. Stronger base models (like GPT-4o-mini) seem to enhance the performance of simpler methods like IO prompting more significantly.
- Temperature Effects: Increasing the sampling temperature generally improved the success rate for CoT with retrials (especially on GPT-4o-mini), suggesting that encouraging diverse attempts through higher temperature synergizes well with the retrial mechanism.
The paper concludes that the significant computational overhead of complex reasoning and refinement strategies might not always be justified. The simple act of allowing an LLM to retry after failure, without explicit feedback, can be a surprisingly powerful and cost-effective way to boost performance, raising the question of whether "retrials are all you need" for many reasoning tasks where answer verification is feasible.
A primary limitation noted is that this method requires a reliable way to verify the correctness of an answer during the generation process, which is straightforward for tasks like Game of 24 or HumanEval (checking equations or running unit tests) but difficult for others like HotpotQA where ground truth is hidden. Future work includes exploring larger budgets, optimizing the retrial process itself, and extending the method to tasks without easy verification.