Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Published 17 Apr 2025 in cs.CL, cs.AI, and cs.LG | (2504.12951v1)

Abstract: Recent advancements in LLMs have catalyzed the development of general-purpose autonomous agents, demonstrating remarkable performance in complex reasoning tasks across various domains. This surge has spurred the evolution of a plethora of prompt-based reasoning frameworks. A recent focus has been on iterative reasoning strategies that refine outputs through self-evaluation and verbalized feedback. However, these strategies require additional computational complexity to enable models to recognize and correct their mistakes, leading to a significant increase in their cost. In this work, we introduce the concept of ``retrials without feedback'', an embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks by allowing LLMs to retry problem-solving attempts upon identifying incorrect answers. Unlike conventional iterative refinement methods, our method does not require explicit self-reflection or verbalized feedback, simplifying the refinement process. Our findings indicate that simpler retrial-based approaches often outperform more sophisticated reasoning frameworks, suggesting that the benefits of complex methods may not always justify their computational costs. By challenging the prevailing assumption that more intricate reasoning strategies inherently lead to better performance, our work offers new insights into how simpler, more efficient approaches can achieve optimal results. So, are retrials all you need?

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a retrial mechanism that boosts LLM reasoning accuracy by allowing simple reattempts without explicit feedback.
The study evaluates this strategy against complex approaches like Tree-of-Thoughts and Reflexion on benchmarks such as Game of 24, HumanEval, and HotpotQA.
Results indicate that task type, model strength, and sampling temperature critically influence the success and cost-efficiency of retrial-enhanced reasoning.

This paper introduces a simple yet effective mechanism called "retrials without feedback" to enhance the reasoning capabilities of LLMs (2504.12951). It challenges the trend of increasingly complex and computationally expensive reasoning frameworks, such as Tree-of-Thoughts (ToT) (Yao et al., 2023) and Reflexion (Shinn et al., 2023), which often rely on iterative refinement through self-evaluation and verbalized feedback. These complex methods incur significant costs due to the added computation and expanding context windows required for feedback generation and processing.

The core idea proposed is straightforward: if an LLM's generated answer to a problem is identified as incorrect (using a simple verifier), the LLM is prompted to simply try solving the problem again from scratch, without any specific feedback on why the previous attempt failed. This process repeats until a correct solution is found or a predefined computational budget (e.g., cost or number of attempts) is exhausted.

The authors compare this retrial mechanism applied to simple prompting strategies like standard Input-Output (IO) and Chain-of-Thought (CoT) (Wei et al., 2022) against more sophisticated methods like ToT and Reflexion. They evaluate these approaches on three benchmark tasks: Game of 24 (mathematical reasoning), HumanEval (code generation), and HotpotQA (multi-hop question answering), using GPT-4o-mini and LLaMA-3.3-70B as base LLMs. Performance is measured by task success rate or accuracy (quality) and computational cost in USD (efficiency).

Key findings indicate that:

Cost-Efficiency: Simpler methods, particularly CoT, augmented with the retrial mechanism, often achieve comparable or superior performance to more complex methods like ToT and Reflexion within a fixed budget. For instance, on Game of 24, CoT with retrials reached high accuracy at a fraction of the cost reported for state-of-the-art refinement methods using more powerful models.
Task and Model Dependency: The relative effectiveness and cost-efficiency vary depending on the task and the base LLM. Stronger base models (like GPT-4o-mini) seem to enhance the performance of simpler methods like IO prompting more significantly.
Temperature Effects: Increasing the sampling temperature generally improved the success rate for CoT with retrials (especially on GPT-4o-mini), suggesting that encouraging diverse attempts through higher temperature synergizes well with the retrial mechanism.

The paper concludes that the significant computational overhead of complex reasoning and refinement strategies might not always be justified. The simple act of allowing an LLM to retry after failure, without explicit feedback, can be a surprisingly powerful and cost-effective way to boost performance, raising the question of whether "retrials are all you need" for many reasoning tasks where answer verification is feasible.

A primary limitation noted is that this method requires a reliable way to verify the correctness of an answer during the generation process, which is straightforward for tasks like Game of 24 or HumanEval (checking equations or running unit tests) but difficult for others like HotpotQA where ground truth is hidden. Future work includes exploring larger budgets, optimizing the retrial process itself, and extending the method to tasks without easy verification.

Markdown Report Issue