Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking
Abstract: Test-time algorithms that combine the generative power of LLMs with process verifiers that assess the quality of partial generations offer a promising lever for eliciting new reasoning capabilities, but the algorithmic design space and computational scaling properties of such approaches are still opaque, and their benefits are far from apparent when one accounts for the cost of learning a high-quality verifier. Our starting point is the observation that seemingly benign errors in a learned verifier can lead to catastrophic failures for standard decoding techniques due to error amplification during the course of generation. We then ask: can this be improved with more sophisticated decoding strategies? We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors. VGB interprets autoregressive generation as a random walk on a tree of partial generations, with transition probabilities guided by the process verifier and base model; crucially, backtracking occurs probabilistically. This process generalizes the seminal Sinclair-Jerrum random walk (Sinclair & Jerrum, 1989) from the literature on approximate counting and sampling in theoretical computer science, and a conceptual contribution of our work is to highlight parallels with this literature. Empirically, we demonstrate on both synthetic and real language modeling tasks that VGB outperforms baselines on a variety of metrics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Plain-language summary of “Taming imperfect process verifiers: Backtracking mitigates the curse of horizon”
Overview
This paper is about helping LLMs (like chatbots) reason better when solving multi-step problems. It studies a strategy where the model generates an answer step by step, and a “process verifier” checks each step along the way. The big idea is: if the verifier sometimes makes mistakes (which is common), a simple way of using it can go badly wrong on long problems. The authors propose a new method, called VGB, that occasionally backtracks—like retracing your steps in a maze—to avoid letting small verifier mistakes grow into big failures.
Key questions the paper asks
- Can we use a step-by-step checker (a “process verifier”) to guide generation without getting derailed when the checker is imperfect?
- Why do some common decoding methods (the ways we pick the next token or chunk during generation) break down as answers get longer?
- Is there a smarter way to sample (pick) the next steps that is provably more robust to verifier errors?
How the method works, in everyday terms
Think of generating an answer like walking through a branching maze:
- Each partial answer is a spot in the maze.
- Moving forward adds a new step (token or chunk).
- A process verifier is like a tour guide who gives a score for your current spot, trying to predict how good the final path will be if you keep going.
The problem: the tour guide (the verifier) isn’t perfect. If you always trust it for every forward step, its small errors can snowball, especially in long mazes. This is known as the “curse of horizon”—tiny mistakes at each step can multiply over many steps.
The proposed solution, VGB (Value-Guided sampling with Backtracking):
- Treat generation as a “random walk” on the tree of partial answers.
- At each step, you decide probabilistically to:
- Move forward (add a step), guided by both the base model and the verifier’s score.
- Or backtrack (erase the last step), guided by the verifier’s score for your current spot.
- Or occasionally stay put (this makes the walk stable).
- This “stochastic backtracking” (randomly retracing when needed) is inspired by a classic technique from theoretical computer science (the Sinclair–Jerrum walk), originally used to sample solutions fairly without letting errors explode.
In simpler words: instead of only marching forward using a sometimes-wrong guide, you sometimes step back. This keeps the overall process balanced and prevents small rating mistakes from controlling the whole result.
Main findings and why they matter
What the authors show theoretically:
- If the verifier’s errors are not too large (even if they aren’t perfect), VGB avoids error amplification that hurts standard methods.
- When you have access to the final task score for full answers (the “outcome-level reward”), VGB can provably sample from the right distribution over good answers (i.e., it’s aiming at the right target, not just “greedy” best answers).
- Even when you don’t have that final reward or the verifier only has average-case accuracy, VGB still provides good coverage of the right kinds of answers and remains robust.
What they show empirically:
- On grammar tasks (like generating balanced parentheses), Python test-case generation, and designed synthetic problems, VGB beats common baselines on multiple metrics (accuracy, diversity, coherence).
- In constrained text generation (you must obey certain rules), VGB gives more coherent results than standard local constraints decoding.
Why this is important:
- It shows that “backtracking” at test time isn’t just a hack—it can be made principled and provably helpful.
- It connects modern LLM decoding to classical ideas in sampling and Markov chains, opening doors to more robust reasoning strategies.
Implications and potential impact
- More reliable reasoning: Models can maintain quality over longer answers because VGB limits the “curse of horizon” where small, repeated verifier errors would otherwise grow.
- Better test-time strategies: Instead of retraining large models, smarter decoding (like VGB) can squeeze more reasoning ability out of the same base model.
- Bridges to theory: The paper links LLM sampling to well-studied mathematical techniques (Markov chain Monte Carlo, Sinclair–Jerrum random walks), suggesting future designs could borrow even more powerful tools from theory.
- Practical trade-offs: VGB uses extra computation at test time (since it may backtrack and explore), but in return it improves robustness and quality—useful in math, code generation, and any multi-step reasoning tasks.
In short: if your guide isn’t perfect, don’t blindly push forward—sometimes stepping back is the smart, provably better move. VGB makes that idea precise and shows it works.
Collections
Sign up for free to add this paper to one or more collections.