Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reasoner-Critic Architecture

Updated 1 February 2026
  • Reasoner-Critic Architecture is a framework that decouples solution generation (reasoner) from evaluation (critic) to improve decision-making and transparency.
  • It utilizes dual-model or role-switching designs where explicit feedback guides iterative refinement, yielding enhanced outcomes in tasks like grading and code generation.
  • Empirical results show that scaling critic capacity significantly boosts performance and interpretability, with metrics improvements in accuracy and F1 scores.

A Reasoner-Critic Architecture is a computational framework that decouples the process of generating solutions (“reasoning”) from the process of evaluating, critiquing, and refining those solutions (“criticism”) in complex machine learning, reinforcement learning, or reasoning tasks. Such architectures can be instantiated as dual-model systems, as alternating roles within a single model via alternating prompt modes, or as more elaborate multi-agent systems with explicit role specialization. Across implementations, the core principle is that explicit, often verbal, feedback from a dedicated Critic model guides or refines the Reasoner’s outputs, producing both higher final performance and increased interpretability compared to pure preference-optimization or single-model chains-of-thought. This decoupling can be realized via supervised fine-tuning, reinforcement learning with verbal feedback, hybrid gradient objectives, preference optimization, or retrospective evaluation, depending on task and application domain (Li et al., 26 Feb 2025).

1. Dual-Model Reasoner-Critic Frameworks

A canonical example is the DARS (Dual-model Reflective Scoring) pipeline (Li et al., 26 Feb 2025), where the Reasoner (R\mathcal{R}) and Critic (C\mathcal{C}) are implemented as two independently fine-tuned LLMs (e.g., LLaMA-3B backbones). The Reasoner produces an initial rationale for a complex task (e.g., open-ended student answer grading). The Critic then inspects the rationale, providing explicit reflection instructions highlighting errors or omissions, or emitting a “[STOP]” token when satisfied. The Reasoner consumes these reflection instructions to refine its output. Iteration continues until Critic affirmation, yielding a corrected, critique-aligned final rationale. The architecture is summarized by:

1
2
3
4
5
6
y_r = Reasoner.generate(x)
while True:
    feedback = Critic.generate(x, y_r)
    if feedback == "[STOP]":
        break
    y_r = Reasoner.generate(x, history=[y_r, feedback])

Empirically, this setup achieves substantial gains over single-model or preference-driven approaches, e.g., +5% ACC, +11% F1, +2% QWK on short-answer tasks. Scaling experiments reveal that larger Critic models yield greater downstream gains than corresponding increases in Reasoner size. This framework can generalize to chain-of-thought mathematical reasoning, multi-hop QA, or other domains (Li et al., 26 Feb 2025).

2. Architectures, Objectives, and Training Procedures

Common architectural motifs include:

  • Explicit Role Decoupling: Two independently parameterized models (DARS, ReaCritic, Critic-V), or two specialized heads operating under a scheduling policy (Stepwise Think-Critique, Critique-Coder, Critic-CoT).
  • Shared Model/Prompt-based Role Switching: A single LLM alternates between reasoner and critic modes (e.g., Critique-Coder (Ruan et al., 26 Sep 2025), LLaVA-Critic-R1 (Wang et al., 31 Aug 2025), Stepwise Think-Critique (Xu et al., 17 Dec 2025)), typically via differently structured prompts and context.
  • Multi-Agent Pipelines: CRV (Critique–Rethink–Verify) uses cascaded LLMs for critique, rewrite, and verification, aligning source data with the cognitive capacity of smaller reasoning models (Cai et al., 14 Apr 2025).
  • Feedback Integration: Critic feedback may be natural language instructions, structured tags, binary scalar judgments, or stepwise label sequences; these are explicitly consumed as new input contexts for the Reasoner, triggering further refinement or termination.

Typical training procedures combine or sequence supervised fine-tuning—often on contrastively synthesized or reflection-annotated data—with reinforcement learning objectives. Losses may include:

  • Reasoner Loss: Maximum likelihood over (input, rationale) and (input, rationale, feedback, refined rationale) pairs
  • Critic Loss: Cross-entropy for reflection instruction or STOP token; DPO-style preference-optimization over ranked critiques
  • Joint Steps: Teacher-forcing, cross-entropy, per-token credit assignment; RL with verbal feedback is realized via clipped PPO, GRPO, or customized advantage signals

Notably, several frameworks emphasize staged training: e.g., Critique-RL (Xi et al., 28 Oct 2025) first maximizes discriminability of the critic using direct ground-truth-aligned reward, then introduces refinement signal conditional on the actor’s post-critique improvement.

3. Algorithmic Patterns and Inference-Time Operation

Inference-time operation of Reasoner-Critic systems is cyclic and interactive. The general loop is as follows:

  1. The Reasoner generates an initial chain-of-thought or solution candidate.
  2. The Critic evaluates and produces structured feedback: either a critique to correct, a binary signal to halt, or a ranked set of diagnoses on individual steps.
  3. The Reasoner consumes the feedback for targeted refinement.
  4. This loop repeats until Critic affirmation or a termination criterion (e.g., budget, depth) is reached.

The Critic’s output can directly shape the Reasoner’s input context—either by textual concatenation (e.g., “Reflection: ...”), attention over new tokens, or explicit prompt engineering.

The following table summarizes representative Reasoner-Critic paradigms:

Paradigm Role Implementation Feedback Type Training Regime
DARS Two models Natural-language critique SFT on synthesized data
Stepwise Think- Critique Single LLM Interleaved critique steps Joint RL + SFT
Critique-Coder Single LLM (prompted) Critique classification RL (GRPO), CRL batch mixing
Critique-RL Actor/Critic LLMs NL feedback + binary labels Two-stage RL
Critic-V Two VLMs NL preference-optimized DPO, rules via VEST

4. Empirical Outcomes and Scaling Laws

Across domains—student answer assessment, code generation, RL for resource allocation, reasoning in vision-LLMs—Reasoner-Critic architectures consistently outperform single-model or pure reward-model-trained systems.

Scaling studies show that increasing Critic capacity generally yields better downstream accuracy than increasing Reasoner capacity, particularly in RL or dense feedback regimes (Li et al., 26 Feb 2025You et al., 16 May 2025Cai et al., 14 Apr 2025).

5. Transparency, Robustness, and Interpretability

The primary motivation for Reasoner-Critic architectures lies in improving system transparency and interpretability:

  • Explicit Feedback Loops: Verbal or structured critiques localize errors, making reasoning paths and model decision boundaries auditable.
  • Stepwise Labeling/Refinement: System-2-style “slow thinking” via stepwise critique (e.g., Critic-CoT (Zheng et al., 2024), Stepwise Think-Critique (Xu et al., 17 Dec 2025)) enhances robustness: invalid or faulty reasoning steps are filtered or corrected in situ.
  • Retrospective and Dense Supervision: CriticSearch demonstrates that retrospectively evaluating entire reasoning traces and providing per-turn reward is effective for stabilizing RL training under sparse outcome supervision (Zhang et al., 15 Nov 2025).

High-accuracy critics also enable downstream applications: filtering and majority voting (e.g., RefCritic (Tang et al., 20 Jul 2025)), targeted test-time scaling (“best-of-K via critique” (Xu et al., 17 Dec 2025)), and interpretability via saliency or purpose maps (A2CR (Guo et al., 2023)).

6. Limitations, Open Challenges, and Future Prospects

Limitations persist, including:

  • Compute Overhead: Training and inference with dual models double FLOPs relative to classic approaches (Li et al., 26 Feb 2025).
  • Task/Domain Specificity: Most frameworks are task-specific; generalization to arbitrary reasoning domains remains an open question.
  • Human-in-the-Loop Extensions: Several methods (REFINER (Paul et al., 2023), Critique-RL) highlight the ease of accommodating human feedback at inference, but robust integration into scalable pipelines is not fully explored.
  • Joint Optimization: Training Reasoner and Critic jointly is delicate; most successful implementations either fix one component (frozen critic in CriticSearch) or adopt staged optimization (two-phase RL in Critique-RL or RefCritic).

Potential extensions include module sharing (via adapters), knowledge transfer for cross-task generalization, and unified architectures where stepwise reasoning and critique are both deeply interleaved (Stepwise Think-Critique).

7. Representative Implementations and Benchmarks

Key systems and associated research groups include:

In summary, the Reasoner-Critic architecture paradigm—whether via explicit model pairs or flexible single-model alternation—enables structured reasoning improvement through successive, interpretable, and actionable critique-and-refinement cycles. This design is increasingly central in scaling model transparency, robustness, and sample efficiency across diverse reasoning and decision-making benchmarks.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reasoner-Critic Architecture.