Reasoner-Critic Architecture
- Reasoner-Critic Architecture is a framework that decouples solution generation (reasoner) from evaluation (critic) to improve decision-making and transparency.
- It utilizes dual-model or role-switching designs where explicit feedback guides iterative refinement, yielding enhanced outcomes in tasks like grading and code generation.
- Empirical results show that scaling critic capacity significantly boosts performance and interpretability, with metrics improvements in accuracy and F1 scores.
A Reasoner-Critic Architecture is a computational framework that decouples the process of generating solutions (“reasoning”) from the process of evaluating, critiquing, and refining those solutions (“criticism”) in complex machine learning, reinforcement learning, or reasoning tasks. Such architectures can be instantiated as dual-model systems, as alternating roles within a single model via alternating prompt modes, or as more elaborate multi-agent systems with explicit role specialization. Across implementations, the core principle is that explicit, often verbal, feedback from a dedicated Critic model guides or refines the Reasoner’s outputs, producing both higher final performance and increased interpretability compared to pure preference-optimization or single-model chains-of-thought. This decoupling can be realized via supervised fine-tuning, reinforcement learning with verbal feedback, hybrid gradient objectives, preference optimization, or retrospective evaluation, depending on task and application domain (Li et al., 26 Feb 2025).
1. Dual-Model Reasoner-Critic Frameworks
A canonical example is the DARS (Dual-model Reflective Scoring) pipeline (Li et al., 26 Feb 2025), where the Reasoner () and Critic () are implemented as two independently fine-tuned LLMs (e.g., LLaMA-3B backbones). The Reasoner produces an initial rationale for a complex task (e.g., open-ended student answer grading). The Critic then inspects the rationale, providing explicit reflection instructions highlighting errors or omissions, or emitting a “[STOP]” token when satisfied. The Reasoner consumes these reflection instructions to refine its output. Iteration continues until Critic affirmation, yielding a corrected, critique-aligned final rationale. The architecture is summarized by:
1 2 3 4 5 6 |
y_r = Reasoner.generate(x) while True: feedback = Critic.generate(x, y_r) if feedback == "[STOP]": break y_r = Reasoner.generate(x, history=[y_r, feedback]) |
Empirically, this setup achieves substantial gains over single-model or preference-driven approaches, e.g., +5% ACC, +11% F1, +2% QWK on short-answer tasks. Scaling experiments reveal that larger Critic models yield greater downstream gains than corresponding increases in Reasoner size. This framework can generalize to chain-of-thought mathematical reasoning, multi-hop QA, or other domains (Li et al., 26 Feb 2025).
2. Architectures, Objectives, and Training Procedures
Common architectural motifs include:
- Explicit Role Decoupling: Two independently parameterized models (DARS, ReaCritic, Critic-V), or two specialized heads operating under a scheduling policy (Stepwise Think-Critique, Critique-Coder, Critic-CoT).
- Shared Model/Prompt-based Role Switching: A single LLM alternates between reasoner and critic modes (e.g., Critique-Coder (Ruan et al., 26 Sep 2025), LLaVA-Critic-R1 (Wang et al., 31 Aug 2025), Stepwise Think-Critique (Xu et al., 17 Dec 2025)), typically via differently structured prompts and context.
- Multi-Agent Pipelines: CRV (Critique–Rethink–Verify) uses cascaded LLMs for critique, rewrite, and verification, aligning source data with the cognitive capacity of smaller reasoning models (Cai et al., 14 Apr 2025).
- Feedback Integration: Critic feedback may be natural language instructions, structured tags, binary scalar judgments, or stepwise label sequences; these are explicitly consumed as new input contexts for the Reasoner, triggering further refinement or termination.
Typical training procedures combine or sequence supervised fine-tuning—often on contrastively synthesized or reflection-annotated data—with reinforcement learning objectives. Losses may include:
- Reasoner Loss: Maximum likelihood over (input, rationale) and (input, rationale, feedback, refined rationale) pairs
- Critic Loss: Cross-entropy for reflection instruction or STOP token; DPO-style preference-optimization over ranked critiques
- Joint Steps: Teacher-forcing, cross-entropy, per-token credit assignment; RL with verbal feedback is realized via clipped PPO, GRPO, or customized advantage signals
Notably, several frameworks emphasize staged training: e.g., Critique-RL (Xi et al., 28 Oct 2025) first maximizes discriminability of the critic using direct ground-truth-aligned reward, then introduces refinement signal conditional on the actor’s post-critique improvement.
3. Algorithmic Patterns and Inference-Time Operation
Inference-time operation of Reasoner-Critic systems is cyclic and interactive. The general loop is as follows:
- The Reasoner generates an initial chain-of-thought or solution candidate.
- The Critic evaluates and produces structured feedback: either a critique to correct, a binary signal to halt, or a ranked set of diagnoses on individual steps.
- The Reasoner consumes the feedback for targeted refinement.
- This loop repeats until Critic affirmation or a termination criterion (e.g., budget, depth) is reached.
The Critic’s output can directly shape the Reasoner’s input context—either by textual concatenation (e.g., “Reflection: ...”), attention over new tokens, or explicit prompt engineering.
The following table summarizes representative Reasoner-Critic paradigms:
| Paradigm | Role Implementation | Feedback Type | Training Regime |
|---|---|---|---|
| DARS | Two models | Natural-language critique | SFT on synthesized data |
| Stepwise Think- Critique | Single LLM | Interleaved critique steps | Joint RL + SFT |
| Critique-Coder | Single LLM (prompted) | Critique classification | RL (GRPO), CRL batch mixing |
| Critique-RL | Actor/Critic LLMs | NL feedback + binary labels | Two-stage RL |
| Critic-V | Two VLMs | NL preference-optimized | DPO, rules via VEST |
4. Empirical Outcomes and Scaling Laws
Across domains—student answer assessment, code generation, RL for resource allocation, reasoning in vision-LLMs—Reasoner-Critic architectures consistently outperform single-model or pure reward-model-trained systems.
- DARS (Reflect w/ Critic) achieves +11% F1 gain over SFT/DPO baselines (Li et al., 26 Feb 2025).
- Critique-Coder (with 20% CRL data) outperforms RL-only variants by +2–7 points across code and logic benchmarks (Ruan et al., 26 Sep 2025).
- ReaCritic (large transformer-based critic) yields +35–40% final reward in RL for high-dimensional HetNets, outperforming shallow MLP critics (You et al., 16 May 2025).
- CriticSearch demonstrates up to +16.7% EM improvement on multi-hop QA by providing dense turn-level credit assignment (Zhang et al., 15 Nov 2025).
Scaling studies show that increasing Critic capacity generally yields better downstream accuracy than increasing Reasoner capacity, particularly in RL or dense feedback regimes (Li et al., 26 Feb 2025You et al., 16 May 2025Cai et al., 14 Apr 2025).
5. Transparency, Robustness, and Interpretability
The primary motivation for Reasoner-Critic architectures lies in improving system transparency and interpretability:
- Explicit Feedback Loops: Verbal or structured critiques localize errors, making reasoning paths and model decision boundaries auditable.
- Stepwise Labeling/Refinement: System-2-style “slow thinking” via stepwise critique (e.g., Critic-CoT (Zheng et al., 2024), Stepwise Think-Critique (Xu et al., 17 Dec 2025)) enhances robustness: invalid or faulty reasoning steps are filtered or corrected in situ.
- Retrospective and Dense Supervision: CriticSearch demonstrates that retrospectively evaluating entire reasoning traces and providing per-turn reward is effective for stabilizing RL training under sparse outcome supervision (Zhang et al., 15 Nov 2025).
High-accuracy critics also enable downstream applications: filtering and majority voting (e.g., RefCritic (Tang et al., 20 Jul 2025)), targeted test-time scaling (“best-of-K via critique” (Xu et al., 17 Dec 2025)), and interpretability via saliency or purpose maps (A2CR (Guo et al., 2023)).
6. Limitations, Open Challenges, and Future Prospects
Limitations persist, including:
- Compute Overhead: Training and inference with dual models double FLOPs relative to classic approaches (Li et al., 26 Feb 2025).
- Task/Domain Specificity: Most frameworks are task-specific; generalization to arbitrary reasoning domains remains an open question.
- Human-in-the-Loop Extensions: Several methods (REFINER (Paul et al., 2023), Critique-RL) highlight the ease of accommodating human feedback at inference, but robust integration into scalable pipelines is not fully explored.
- Joint Optimization: Training Reasoner and Critic jointly is delicate; most successful implementations either fix one component (frozen critic in CriticSearch) or adopt staged optimization (two-phase RL in Critique-RL or RefCritic).
Potential extensions include module sharing (via adapters), knowledge transfer for cross-task generalization, and unified architectures where stepwise reasoning and critique are both deeply interleaved (Stepwise Think-Critique).
7. Representative Implementations and Benchmarks
Key systems and associated research groups include:
- DARS: Dual-Model Reflective Scoring for student answer grading (Li et al., 26 Feb 2025).
- ReaCritic: Transformer-based critic for DRL and heterogeneous networks (You et al., 16 May 2025).
- Critique-Coder: Unified policy/critic for code generation and general reasoning (Ruan et al., 26 Sep 2025).
- Critic-CoT: Stepwise chain-of-thought critic, iterated diagnosis and filtering (Zheng et al., 2024).
- Stepwise Think-Critique: Interleaved reasoning and self-critique for math (Xu et al., 17 Dec 2025).
- Critic-V: DPO-trained VLM critics for VQA and multimodal tasks (Zhang et al., 2024).
- RefCritic: Long-chain critic with refinement reward, process-level step detection (Tang et al., 20 Jul 2025).
- CRV: Critique-Rethink-Verify pipeline for small reasoning LLMs (Cai et al., 14 Apr 2025).
- OpenREAD, CriticSearch: End-to-end RL and retrospective credit assignment for tool-augmented and vision-language domains (Zhang et al., 1 Dec 2025Zhang et al., 15 Nov 2025).
In summary, the Reasoner-Critic architecture paradigm—whether via explicit model pairs or flexible single-model alternation—enables structured reasoning improvement through successive, interpretable, and actionable critique-and-refinement cycles. This design is increasingly central in scaling model transparency, robustness, and sample efficiency across diverse reasoning and decision-making benchmarks.