Consistency Group Relative Policy Optimization (Con-GRPO)

Updated 8 February 2026

Con-GRPO is a reinforcement learning algorithm that optimizes consistency across semantically equivalent queries using group-based reward computation.
It employs within-group normalized advantage estimation and a clipped PPO surrogate loss to enforce output agreement in retrieval-augmented generation systems.
Empirical benchmarks show significant gains in lexical and LLM-judge consistency as well as accuracy on both short- and long-form QA tasks.

Consistency Group Relative Policy Optimization (Con-GRPO) is a reinforcement learning (RL) algorithmic framework developed to directly optimize consistency objectives across groups of semantically equivalent inputs. The primary context of its application is language generation systems, particularly Retrieval-Augmented Generation (RAG), where ensuring that paraphrased queries yield consistent (i.e., stable in informational content) answers is essential for reliability, trust, and compliance in high-stakes settings. Con-GRPO instantiates Group Relative Policy Optimization (GRPO), where advantage estimation and optimization occur not for individual samples, but in a groupwise context with rewards reflecting cross-sample agreement, thereby enabling direct control of consistency properties (Hamman et al., 5 Oct 2025).

1. Formal Setting, Definitions, and Notation

Con-GRPO is typically instantiated in RAG architectures, comprising:

A retriever $R$ that maps a query $q$ to a document set $R(q)=\mathcal{D}(q)\subset\mathcal{D}$ .
A generator (parametric policy) $\pi_\theta$ , which yields a distribution over output sequences $y$ conditional on the query and retrieved documents:

$\pi_\theta(y|q) = \pi_\theta(y|q,\;R(q))$

A paraphrase set (group) $G = \{q_1, \dots, q_n\} = \mathcal{P}(q_0)$ , representing $n$ semantically equivalent variants of canonical query $q_0$ .

For each paraphrase $q_i$ , $g$ rollouts are sampled, generating $\{o_{i,1}, \ldots, o_{i,g}\}$ , with $o_{i,j}\sim\pi_\theta(\cdot|q_i)$ for $i=1,\ldots,n$ , $j=1,\ldots,g$ .

The principal consistency goal is for all outputs corresponding to any paraphrase in $G$ to convey the same core information, regardless of input phrasing or retriever variability.

2. Group Similarity Rewards and Computation

Central to Con-GRPO is the group similarity reward, which incentivizes statistical agreement among the outputs for all members of the paraphrase set. Pairwise similarity between outputs is measured using a token-level function, most commonly BLEU- $k$ (unigram for short-form, bigram for long-form QA):

$r_{i,j} = \frac{1}{(n-1)\,g} \sum_{u\neq i}^{n} \sum_{m=1}^{g} \operatorname{sim}\bigl(o_{i,j},\,o_{u,m}\bigr)$

Each rollout's reward is its average similarity to all rollouts from other paraphrases in the group. For short-form QA where a reference $y^\star$ exists, an accuracy term $\mathrm{Acc}(o_{i,j}, y^\star)$ (e.g., token F1 or exact match) is added, forming the final reward:

$r_{i,j}^{\mathrm{final}} = \alpha\,r_{i,j}^{\mathrm{cons}} + \gamma\,\mathrm{Acc}(o_{i,j},\,y^\star)$

where hyperparameters $\alpha, \gamma \ge 0$ (typically $\alpha=\gamma=1$ for short-form, $\gamma=0$ for long-form/open-ended).

To address the $O(n^2g^2)$ cost of all-pairs group reward computation, a scalable unbiased approximation samples only a subset of other paraphrases ( $\kappa \ll n-1$ ) and rollouts ( $s \ll g$ ):

$\tilde r_{i,j} = \frac{1}{\kappa\,s} \sum_{u\in K} \sum_{m\in S_u} \operatorname{sim}(o_{i,j}, o_{u,m})$

with $K\subset\{1,\ldots,n\}\setminus\{i\}$ and $S_u\subset\{1,\ldots, g\}$ .

In practice, $n=6$ , $g=4$ , $\kappa=3$ , $s=1$ suffice, resulting in linear computational scaling.

3. Policy Optimization and GRPO Objective

Rather than estimate a state-value critic, Con-GRPO employs a within-group normalization for advantage estimation. For each paraphrase $q_i$ :

Compute the reward mean $\mu_i$ and standard deviation $\sigma_i$ :

$\mu_i = \frac{1}{g}\sum_{j=1}^g r_{i,j}, \quad \sigma_i = \sqrt{\frac{1}{g}\sum_{j=1}^g (r_{i,j} - \mu_i)^2}$

The normalized advantage is

$\hat{A}_{i,j} = \frac{r_{i,j} - \mu_i}{\sigma_i + \epsilon}$

for numerical stability.

The GRPO surrogate loss (PPO-style) is:

$\mathcal{L}_\mathrm{GRPO}(\theta) = \frac{1}{ng} \sum_{i=1}^n \sum_{j=1}^g \sum_{t=1}^{|o_{i,j}|} \min\left(\rho_{i,j,t} \hat{A}_{i,j}, \operatorname{clip}(\rho_{i,j,t}, 1-\epsilon, 1+\epsilon)\hat{A}_{i,j}\right) - \beta\,\mathrm{KL}\left(\pi_\theta(\cdot|q_i)\Vert\pi_{\mathrm{ref}}(\cdot|q_i)\right)$

where $\rho_{i,j,t}$ is the (possibly token-level) importance weight, $\epsilon$ the clipping parameter, and $\beta$ a KL-regularization coefficient to avoid policy drift.

In effect, this structure:

Increases the probability of outputs whose normalized agreement within their paraphrase group is high.
Penalizes significant divergence from the reference (pre-trained) policy.

4. Training Procedure, Implementation, and Hyperparameters

The Con-GRPO procedure is organized as follows:

For each batch, sample a set of canonical queries.
For each canonical query, generate $n$ paraphrases.
Retrieve documents and sample $g$ rollouts per paraphrase.
Compute approximate group similarity rewards per rollout.
Normalize rewards within paraphrase groups to compute advantages.
Accumulate the clipped GRPO loss and update parameters via backpropagation.

Key hyperparameters:

Paraphrase set size: $n=6$
Rollouts per paraphrase: $g=4$
BLEU-$1$ for short-form, BLEU-$2$ for long-form
Reward weights: $\alpha=1$ , $\gamma=1$ (short-form), $\gamma=0$ (open-ended)
KL penalty: $\beta=0.0$ (supervised), $\beta=0.05$ (open-ended)
Optimizer: AdamW, learning rate $1\mathrm{e}{-6}$
Decoding temperature at training matches inference (commonly $T=0.0$ for deterministic rollouts)
Batch size: 1–2 canonical queries per GPU (dependent on model/memory)

5. Empirical Outcomes and Benchmarks

Con-GRPO, instantiated as "Information Consistent RAG" (Con-RAG), demonstrates pronounced consistency and accuracy gains relative to standard RAG and strong RL baselines (e.g., DRAG, CoT-RAG, SFT):

Short-form QA (LLaMA-3.1-8B, TriviaQA):

Metric	RAG Baseline	Con-RAG
End-to-end lexical consistency (%)	53.0	87.3
End-to-end LLM-judge consistency (%)	77.8	91.3
Generator lexical consistency (%)	67.3	91.2
Generator LLM-judge consistency (%)	88.5	93.0
Exact Match accuracy (%)	56.0	77.0
Token F1 (%)	66.1	81.0

Long-form QA (ELI5, LLaMA-3.1-8B):

Metric	RAG Baseline	Con-RAG
End-to-end lexical consistency (%)	8.6	14.6
End-to-end LLM-judge consistency (%)	62.8	72.7
Generator lexical consistency (%)	15.1	21.7
Generator LLM-judge consistency (%)	74.2	80.8
ROUGE accuracy	21.9	24.2
LLM-judge accuracy (%)	74.0	78.0

Con-GRPO continues to outperform supervised-fine-tuning (SFT) even when ground truth is unavailable, confirming the efficacy of group similarity rewards for open-ended or reference-free tasks (Hamman et al., 5 Oct 2025).

6. Relationship to Broader GRPO Paradigm and Extensions

Con-GRPO is a specialization of GRPO for information consistency, leveraging group reward computation and normalization within paraphrase sets. Related GRPO variants extend the paradigm to other objectives:

Constrained GRPO imposes explicit behavioral constraints via Lagrangian relaxation and scalarized advantage construction, outperforming naive scalarization approaches in both theory and practice (Girgis et al., 5 Feb 2026).
Consensus GRPO distills Minimum Bayes Risk decoding into a policy optimized using only groupwise consensus (e.g., BLEURT) as utility, eliminating reliance on gold references or preference labels (Ichihara et al., 3 Feb 2026).
Continuous Control Con-GRPO applies group-based normalization to trajectory clusters and state-aware credit assignment for high-dimensional, continuous-action tasks (Khanda et al., 25 Jul 2025).

These extensions maintain the core principle of intra-group normalized advantage estimation, with variations in reward construction and applicability.

7. Limitations and Future Directions

While Con-GRPO has demonstrated clear empirical gains in consistency and accuracy across QA and RAG tasks, limitations and open research directions include:

Reward functions are dependent on lexical similarity (e.g., BLEU), which may not fully capture semantic agreement.
Approximate reward computation, while efficient, introduces stochastic variance not present in exhaustive all-pairs evaluation.
Extension to more diverse and challenging paraphrase sets, dialogue, non-English queries, and non-retrieval settings remains to be explored.
Integration of richer semantic similarity metrics or human-aligned utility functions to minimize the risk of overfitting to lexical form rather than true informational content.

The framework provides a reproducible and theoretically grounded approach for aligning large-scale neural generation systems with consistency-centric deployment constraints (Hamman et al., 5 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (4)

Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards (2025)

Constrained Group Relative Policy Optimization (2026)

Consensus Group Relative Policy Optimization for Text Generation (2026)

Extending Group Relative Policy Optimization to Continuous Control: A Theoretical Framework for Robotic Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Consistency Group Relative Policy Optimization (Con-GRPO).

Consistency Group Relative Policy Optimization (Con-GRPO)

1. Formal Setting, Definitions, and Notation

2. Group Similarity Rewards and Computation

3. Policy Optimization and GRPO Objective

4. Training Procedure, Implementation, and Hyperparameters

5. Empirical Outcomes and Benchmarks

6. Relationship to Broader GRPO Paradigm and Extensions

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Consistency Group Relative Policy Optimization (Con-GRPO)

1. Formal Setting, Definitions, and Notation

2. Group Similarity Rewards and Computation

3. Policy Optimization and GRPO Objective

4. Training Procedure, Implementation, and Hyperparameters

5. Empirical Outcomes and Benchmarks

6. Relationship to Broader GRPO Paradigm and Extensions

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research