Papers
Topics
Authors
Recent
Search
2000 character limit reached

Consistency Group Relative Policy Optimization (Con-GRPO)

Updated 8 February 2026
  • Con-GRPO is a reinforcement learning algorithm that optimizes consistency across semantically equivalent queries using group-based reward computation.
  • It employs within-group normalized advantage estimation and a clipped PPO surrogate loss to enforce output agreement in retrieval-augmented generation systems.
  • Empirical benchmarks show significant gains in lexical and LLM-judge consistency as well as accuracy on both short- and long-form QA tasks.

Consistency Group Relative Policy Optimization (Con-GRPO) is a reinforcement learning (RL) algorithmic framework developed to directly optimize consistency objectives across groups of semantically equivalent inputs. The primary context of its application is language generation systems, particularly Retrieval-Augmented Generation (RAG), where ensuring that paraphrased queries yield consistent (i.e., stable in informational content) answers is essential for reliability, trust, and compliance in high-stakes settings. Con-GRPO instantiates Group Relative Policy Optimization (GRPO), where advantage estimation and optimization occur not for individual samples, but in a groupwise context with rewards reflecting cross-sample agreement, thereby enabling direct control of consistency properties (Hamman et al., 5 Oct 2025).

1. Formal Setting, Definitions, and Notation

Con-GRPO is typically instantiated in RAG architectures, comprising:

  • A retriever RR that maps a query qq to a document set R(q)=D(q)DR(q)=\mathcal{D}(q)\subset\mathcal{D}.
  • A generator (parametric policy) πθ\pi_\theta, which yields a distribution over output sequences yy conditional on the query and retrieved documents:

πθ(yq)=πθ(yq,  R(q))\pi_\theta(y|q) = \pi_\theta(y|q,\;R(q))

  • A paraphrase set (group) G={q1,,qn}=P(q0)G = \{q_1, \dots, q_n\} = \mathcal{P}(q_0), representing nn semantically equivalent variants of canonical query q0q_0.

For each paraphrase qiq_i, gg rollouts are sampled, generating {oi,1,,oi,g}\{o_{i,1}, \ldots, o_{i,g}\}, with oi,jπθ(qi)o_{i,j}\sim\pi_\theta(\cdot|q_i) for i=1,,ni=1,\ldots,n, j=1,,gj=1,\ldots,g.

The principal consistency goal is for all outputs corresponding to any paraphrase in GG to convey the same core information, regardless of input phrasing or retriever variability.

2. Group Similarity Rewards and Computation

Central to Con-GRPO is the group similarity reward, which incentivizes statistical agreement among the outputs for all members of the paraphrase set. Pairwise similarity between outputs is measured using a token-level function, most commonly BLEU-kk (unigram for short-form, bigram for long-form QA):

ri,j=1(n1)guinm=1gsim(oi,j,ou,m)r_{i,j} = \frac{1}{(n-1)\,g} \sum_{u\neq i}^{n} \sum_{m=1}^{g} \operatorname{sim}\bigl(o_{i,j},\,o_{u,m}\bigr)

Each rollout's reward is its average similarity to all rollouts from other paraphrases in the group. For short-form QA where a reference yy^\star exists, an accuracy term Acc(oi,j,y)\mathrm{Acc}(o_{i,j}, y^\star) (e.g., token F1 or exact match) is added, forming the final reward:

ri,jfinal=αri,jcons+γAcc(oi,j,y)r_{i,j}^{\mathrm{final}} = \alpha\,r_{i,j}^{\mathrm{cons}} + \gamma\,\mathrm{Acc}(o_{i,j},\,y^\star)

where hyperparameters α,γ0\alpha, \gamma \ge 0 (typically α=γ=1\alpha=\gamma=1 for short-form, γ=0\gamma=0 for long-form/open-ended).

To address the O(n2g2)O(n^2g^2) cost of all-pairs group reward computation, a scalable unbiased approximation samples only a subset of other paraphrases (κn1\kappa \ll n-1) and rollouts (sgs \ll g):

r~i,j=1κsuKmSusim(oi,j,ou,m)\tilde r_{i,j} = \frac{1}{\kappa\,s} \sum_{u\in K} \sum_{m\in S_u} \operatorname{sim}(o_{i,j}, o_{u,m})

with K{1,,n}{i}K\subset\{1,\ldots,n\}\setminus\{i\} and Su{1,,g}S_u\subset\{1,\ldots, g\}.

In practice, n=6n=6, g=4g=4, κ=3\kappa=3, s=1s=1 suffice, resulting in linear computational scaling.

3. Policy Optimization and GRPO Objective

Rather than estimate a state-value critic, Con-GRPO employs a within-group normalization for advantage estimation. For each paraphrase qiq_i:

  • Compute the reward mean μi\mu_i and standard deviation σi\sigma_i:

μi=1gj=1gri,j,σi=1gj=1g(ri,jμi)2\mu_i = \frac{1}{g}\sum_{j=1}^g r_{i,j}, \quad \sigma_i = \sqrt{\frac{1}{g}\sum_{j=1}^g (r_{i,j} - \mu_i)^2}

  • The normalized advantage is

A^i,j=ri,jμiσi+ϵ\hat{A}_{i,j} = \frac{r_{i,j} - \mu_i}{\sigma_i + \epsilon}

for numerical stability.

The GRPO surrogate loss (PPO-style) is:

LGRPO(θ)=1ngi=1nj=1gt=1oi,jmin(ρi,j,tA^i,j,clip(ρi,j,t,1ϵ,1+ϵ)A^i,j)βKL(πθ(qi)πref(qi))\mathcal{L}_\mathrm{GRPO}(\theta) = \frac{1}{ng} \sum_{i=1}^n \sum_{j=1}^g \sum_{t=1}^{|o_{i,j}|} \min\left(\rho_{i,j,t} \hat{A}_{i,j}, \operatorname{clip}(\rho_{i,j,t}, 1-\epsilon, 1+\epsilon)\hat{A}_{i,j}\right) - \beta\,\mathrm{KL}\left(\pi_\theta(\cdot|q_i)\Vert\pi_{\mathrm{ref}}(\cdot|q_i)\right)

where ρi,j,t\rho_{i,j,t} is the (possibly token-level) importance weight, ϵ\epsilon the clipping parameter, and β\beta a KL-regularization coefficient to avoid policy drift.

In effect, this structure:

  • Increases the probability of outputs whose normalized agreement within their paraphrase group is high.
  • Penalizes significant divergence from the reference (pre-trained) policy.

4. Training Procedure, Implementation, and Hyperparameters

The Con-GRPO procedure is organized as follows:

  1. For each batch, sample a set of canonical queries.
  2. For each canonical query, generate nn paraphrases.
  3. Retrieve documents and sample gg rollouts per paraphrase.
  4. Compute approximate group similarity rewards per rollout.
  5. Normalize rewards within paraphrase groups to compute advantages.
  6. Accumulate the clipped GRPO loss and update parameters via backpropagation.

Key hyperparameters:

  • Paraphrase set size: n=6n=6
  • Rollouts per paraphrase: g=4g=4
  • BLEU-$1$ for short-form, BLEU-$2$ for long-form
  • Reward weights: α=1\alpha=1, γ=1\gamma=1 (short-form), γ=0\gamma=0 (open-ended)
  • KL penalty: β=0.0\beta=0.0 (supervised), β=0.05\beta=0.05 (open-ended)
  • Optimizer: AdamW, learning rate 1e61\mathrm{e}{-6}
  • Decoding temperature at training matches inference (commonly T=0.0T=0.0 for deterministic rollouts)
  • Batch size: 1–2 canonical queries per GPU (dependent on model/memory)

5. Empirical Outcomes and Benchmarks

Con-GRPO, instantiated as "Information Consistent RAG" (Con-RAG), demonstrates pronounced consistency and accuracy gains relative to standard RAG and strong RL baselines (e.g., DRAG, CoT-RAG, SFT):

Short-form QA (LLaMA-3.1-8B, TriviaQA):

Metric RAG Baseline Con-RAG
End-to-end lexical consistency (%) 53.0 87.3
End-to-end LLM-judge consistency (%) 77.8 91.3
Generator lexical consistency (%) 67.3 91.2
Generator LLM-judge consistency (%) 88.5 93.0
Exact Match accuracy (%) 56.0 77.0
Token F1 (%) 66.1 81.0

Long-form QA (ELI5, LLaMA-3.1-8B):

Metric RAG Baseline Con-RAG
End-to-end lexical consistency (%) 8.6 14.6
End-to-end LLM-judge consistency (%) 62.8 72.7
Generator lexical consistency (%) 15.1 21.7
Generator LLM-judge consistency (%) 74.2 80.8
ROUGE accuracy 21.9 24.2
LLM-judge accuracy (%) 74.0 78.0

Con-GRPO continues to outperform supervised-fine-tuning (SFT) even when ground truth is unavailable, confirming the efficacy of group similarity rewards for open-ended or reference-free tasks (Hamman et al., 5 Oct 2025).

6. Relationship to Broader GRPO Paradigm and Extensions

Con-GRPO is a specialization of GRPO for information consistency, leveraging group reward computation and normalization within paraphrase sets. Related GRPO variants extend the paradigm to other objectives:

  • Constrained GRPO imposes explicit behavioral constraints via Lagrangian relaxation and scalarized advantage construction, outperforming naive scalarization approaches in both theory and practice (Girgis et al., 5 Feb 2026).
  • Consensus GRPO distills Minimum Bayes Risk decoding into a policy optimized using only groupwise consensus (e.g., BLEURT) as utility, eliminating reliance on gold references or preference labels (Ichihara et al., 3 Feb 2026).
  • Continuous Control Con-GRPO applies group-based normalization to trajectory clusters and state-aware credit assignment for high-dimensional, continuous-action tasks (Khanda et al., 25 Jul 2025).

These extensions maintain the core principle of intra-group normalized advantage estimation, with variations in reward construction and applicability.

7. Limitations and Future Directions

While Con-GRPO has demonstrated clear empirical gains in consistency and accuracy across QA and RAG tasks, limitations and open research directions include:

  • Reward functions are dependent on lexical similarity (e.g., BLEU), which may not fully capture semantic agreement.
  • Approximate reward computation, while efficient, introduces stochastic variance not present in exhaustive all-pairs evaluation.
  • Extension to more diverse and challenging paraphrase sets, dialogue, non-English queries, and non-retrieval settings remains to be explored.
  • Integration of richer semantic similarity metrics or human-aligned utility functions to minimize the risk of overfitting to lexical form rather than true informational content.

The framework provides a reproducible and theoretically grounded approach for aligning large-scale neural generation systems with consistency-centric deployment constraints (Hamman et al., 5 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Consistency Group Relative Policy Optimization (Con-GRPO).