Paraphrase Consistency (PC@k)

Updated 18 January 2026

Paraphrase Consistency (PC@k) is a metric that measures how uniformly a model responds to multiple meaning-preserving paraphrases.
It evaluates consistency in tasks such as classification, open-ended generation, factual extraction, and retrieval using both strict and soft agreement methods.
Recent studies demonstrate that techniques like paraphrase-aware fine-tuning and auxiliary data integration can significantly enhance PC@k performance.

Paraphrase Consistency (PC@k) quantifies the stability of a model’s predictions or outputs under meaning-preserving paraphrastic reformulations of the same input. The metric operationalizes the ideal that language understanding systems should be invariant under surface-form alternations that do not change semantics. Recent research has formalized, evaluated, and benchmarked PC@k across tasks including classification, open-ended generation, factual knowledge extraction, and retrieval-augmented systems.

1. Formal Definitions and Core Variants

Paraphrase Consistency at k (PC@k) measures the rate or likelihood that a model yields consistent outputs across k paraphrases of a single input. For a deterministic classification model, this is typically:

$\text{PC@}k = \frac{1}{n} \sum_{i=1}^n \llbracket r_i^{(0)} = r_i^{(1)} = \cdots = r_i^{(k)} \rrbracket$

where $r_i^{(j)}$ is the model’s prediction on the $j$ th paraphrase (with $j=0$ for the original) of the $i$ th item, and $\llbracket \cdot \rrbracket$ is the indicator function. PC@1 denotes consistency over a single paraphrased variant per example.

A “soft” variant counts the average pairwise or per-paraphrase agreement; a “strict” variant requires identical predictions across all $k+1$ variants (Verma et al., 2023, Carranza, 8 Oct 2025).

For tasks with multiple paraphrases per input (“bucketed” paraphrase sets), PC@k generalizes as the expected probability that predictions are invariant across all paraphrases, optionally formulated in terms of correctness indicators or output overlap (Srikanth et al., 2024, Elazar et al., 2021).

For open-ended or generative tasks, PC@k is measured via average pairwise similarity (e.g., BLEU, ROUGE, BERTScore) among outputs derived from meaning-preserving paraphrases (Hamman et al., 5 Oct 2025).

2. Methodologies for Measuring Paraphrase Consistency

The empirical evaluation of PC@k requires:

Construction or curation of paraphrase clusters: Each original item is paired with $k$ or more validated, semantically equivalent reformulations, obtained from LLMs, human crowdsourcing, or expert rewriting (e.g., PaRTE (Verma et al., 2023), ParaNlu (Srikanth et al., 2024), ParaRel (Elazar et al., 2021), RoParQ (Choi, 26 Nov 2025)).
Model predictions are recorded for all paraphrase variants.
Consistency is assessed by direct agreement (classification/generation) or continuous similarity (open-ended generation).

Different levels of abstraction are possible, including:

Binary correctness agreement: For binary tasks, PC@k reflects the probability that all paraphrases yield the same correct/incorrect status (Srikanth et al., 2024).
Top-k answer overlap: For knowledge extraction, PC@k is the macro-averaged rate at which the gold answer (or a predicted answer) appears in the model’s top-k outputs across all paraphrases (Elazar et al., 2021).
Output similarity measures: In RAG systems, PC@k may average pairwise output similarity between all paraphrased queries, decomposed into retriever-level, generator-level, and end-to-end consistency (Hamman et al., 5 Oct 2025).

Table: Representative PC@k Definitions

Paper	Task	PC@1/PC@k Definition
(Elazar et al., 2021)	Fact cloze	Top-k overlap across patterns
(Verma et al., 2023)	RTE	Strict agreement on label
(Srikanth et al., 2024)	Reasoning	Probability of correctness match
(Carranza, 8 Oct 2025)	MCQ QA	Strict answer identity over variants
(Hamman et al., 5 Oct 2025)	RAG	Pairwise output similarity

3. Empirical Findings and Task-Specific Patterns

State-of-the-art models show substantial variability in PC@k, with the following quantitative trends across domains:

Textual entailment & reasoning: Modern transformer models change predictions in 8–16% of paraphrase pairs on RTE tasks; PC@k increases with model size and pretraining, but plateaus below human ceilings (Verma et al., 2023, Srikanth et al., 2024).
Knowledge probing: PLMs like BERT-base and RoBERTa-large achieve PC@1 around 50–60% (with high relation-dependent variance), far below perfect invariance (Elazar et al., 2021).
MCQ QA: Accuracy drops 6–10 percentage points between original and paraphrased questions for strong LLMs (Mistral-7B, Qwen2.5-7B), reflecting PC@1 ≈ 0.90 in best cases (Carranza, 8 Oct 2025).
Retrieval-augmented generation: Output-level paraphrase consistency is further decomposed into retriever and generator contributions, with overall end-to-end PC@k reflecting both sources of brittleness (Hamman et al., 5 Oct 2025).
Supervised fine-tuning: Explicit SFT protocols incorporating paraphrase-aware supervision can significantly raise paraphrase robustness, sometimes matching the invariance of much larger models (Choi, 26 Nov 2025).

4. Metric Generalizations and Theoretical Properties

PC@k admits natural generalizations and theoretical decompositions:

Soft vs. strict PC@k: Softer variants measure fractional agreement or similarity, while strict PC@k only counts examples with perfect invariance. The former is sensitive to partial robustness; the latter is more brittle.
Link to variance decomposition: For correctness-based PC@k, the metric corresponds to $1 - 2\cdot \operatorname{E}_b[\theta_b(1-\theta_b)]$ , where $\theta_b$ is per-bucket accuracy, thereby connecting paraphrase consistency to within-problem variance attributable to phrasing (Srikanth et al., 2024).
Relation to overall accuracy: As accuracy approaches 0 or 1, PC@k is lower-bounded by $r_i^{(j)}$ 0, where $r_i^{(j)}$ 1 is task accuracy. Thus, perfect consistency is trivial if the model is always right or always wrong.
Continuous metrics: XParaCon measures the negative log₂ mean standard deviation of accuracy across paraphrase variants, encoding the magnitude of instability rather than just counting flips. This continuous scoring enables finer discrimination where PC@k is insensitive (Choi, 26 Nov 2025).
Similarity-based/Generative evaluation: In open-ended generation and RAG, output consistency is measured by BLEU, ROUGE, or LLM-judge similarity, averaged over all paraphrase pairs (Hamman et al., 5 Oct 2025).

5. Protocols and Interventions for Improving PC@k

Several strategies systematically improve paraphrase consistency:

Multi-task and auxiliary data: Adding paraphrase identification or STS data into training (e.g., QQP, MRPC) boosts REVERSE-type consistency by 13% on NLI tasks (Jang et al., 2021).
Consistency regularization: Incorporating explicit loss terms to penalize divergent output distributions for paraphrased inputs raises PC@1 on unseen relations and improves “Consistent-Acc” by a statistically significant margin (Elazar et al., 2021).
Paraphrase-aware SFT and RL: Supervising the model to restate the question and explicitly align outputs across paraphrases during fine-tuning raises both XParaCon and empirical PC@k metrics, with lightweight models sometimes matching the robustness of much larger ones (Choi, 26 Nov 2025).
RL with group-similarity rewards: In RAG, Paraphrased Set Group Relative Policy Optimization (PS-GRPO) assigns rewards to LLM samples based on cross-paraphrase group similarity, directly maximizing end-to-end PC@k (Hamman et al., 5 Oct 2025).

6. Caveats, Limitations, and Future Directions

While PC@k robustly characterizes paraphrastic sensitivity, several limitations persist:

Limited by paraphrase quality: Human-elicited paraphrases expose more brittleness than templated LLM or automatic rewrites, and nontrivial effort is required to guarantee semantic invariance (Srikanth et al., 2024).
PC@k does not capture calibration: Two paraphrases may both be answered correctly, but with divergent model confidence; PC@k is agnostic to this distinction (Srikanth et al., 2024).
Task specificity: The most stringent PC@k forms (identical outputs) are readily definable for classification and MCQ, but adaptation is needed for free-form generation.
Continuous robustness: Discrete PC@k may miss incremental improvement; metrics like XParaCon provide complementary, magnitude-aware scoring (Choi, 26 Nov 2025).
Unexplained variance: Even after controlling for artifact-prone data and model capacity, significant variance in output correctness is still attributable to paraphrastic variability.
Practical computation: For RAG, naive group-similarity reward computation scales poorly with $r_i^{(j)}$ 2 (number of paraphrases) and $r_i^{(j)}$ 3 (generation samples); subsampling or approximation is necessary (Hamman et al., 5 Oct 2025).

Open directions include optimizing for paraphrase invariance across more diverse paraphrastic phenomena (negation, compositionality), integrating semantic equivalence judgments from advanced LLMs, and developing more efficient yet rigorous group-level training objectives.

Key references: “Accurate, yet inconsistent? Consistency Analysis on Language Understanding Models” (Jang et al., 2021), “Evaluating Paraphrastic Robustness in Textual Entailment Models” (Verma et al., 2023), “Measuring and Improving Consistency in Pretrained LLMs” (Elazar et al., 2021), “How often are errors in natural language reasoning due to paraphrastic variability?” (Srikanth et al., 2024), “LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests” (Carranza, 8 Oct 2025), “Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards” (Hamman et al., 5 Oct 2025), “RoParQ: Paraphrase-Aware Alignment of LLMs Towards Robustness to Paraphrased Questions” (Choi, 26 Nov 2025).