Quality-Adjusted Consistency (QAC)

Updated 9 February 2026

Quality-Adjusted Consistency (QAC) is a metric that combines consistency measures with task quality to prevent misleading evaluations when systems perform uniformly poorly.
It is applied in multilingual NLP, robot learning from demonstration, expert estimate aggregation, and contrast set testing to ensure robust and meaningful performance metrics.
Implementations use domain-specific calibration and simulation techniques, balancing consistent behavior with real-world task effectiveness.

Quality-Adjusted Consistency (QAC) is a class of evaluation metrics and thresholding strategies that combine measures of consistency with explicit or implicit factors of task or model quality. QAC has been formulated independently in multilingual natural language processing, robot learning from demonstration, group decision-making with expert estimates, and contrast set robustness testing. Its overarching objective is to provide a single or thresholded scalar that penalizes high consistency in the presence of low task utility, thereby safeguarding against degenerate cases where systems are consistently but uniformly erroneous.

1. Formal Definitions Across Domains

1.1. NLP Multilingual Model Evaluation

In the context of bilingual evaluation for LLMs on politically sensitive prompts, QAC is defined for a model $M$ as follows (Ko, 6 Feb 2026):

$\mathrm{Score}_{M,L} \in [0,1]$ : Fraction of benchmark prompts passed in language $L$ (e.g., English or Chinese).

$\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$

Consistency $C_M \in [0,1]$ : Fraction of prompts with identical binary pass/fail across languages.

$C_M = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,\mathrm{zh}}(i) = \mathrm{Pass}_{M,\mathrm{en}}(i)]$

Quality-Adjusted Consistency:

$\mathrm{QAC}_M = C_M \times \min\bigl(\mathrm{Score}_{M,\mathrm{zh}},\,\mathrm{Score}_{M,\mathrm{en}}\bigr)$

1.2. Robot Learning from Demonstration (LfD)

In LfD, QAC is operationalized as an aggregate measure over multiple consistency-based data-quality metrics (Sakr et al., 2024). Let $Q_k$ denote each of $K$ normalized motion quality metrics (e.g., jerk, path length, manipulability):

For each metric, compute the per-user range $Range_k$ , normalized as $\mathrm{Score}_{M,L} \in [0,1]$ 0 using min-max statistics $\mathrm{Score}_{M,L} \in [0,1]$ 1 from a calibration set.
Per-metric consistency: $\mathrm{Score}_{M,L} \in [0,1]$ 2.
Combined QAC (for $\mathrm{Score}_{M,L} \in [0,1]$ 3 metrics):

$\mathrm{Score}_{M,L} \in [0,1]$ 4

1.3. Expert Estimate Aggregation

For pairwise comparison matrices (PCMs), QAC defines a theoretically justified consistency threshold tied to a permissible maximum deviation $\mathrm{Score}_{M,L} \in [0,1]$ 5 in aggregated weights (Tsyganok et al., 2024):

For a matrix $\mathrm{Score}_{M,L} \in [0,1]$ 6, a “spectrum-based” index $\mathrm{Score}_{M,L} \in [0,1]$ 7 is computed.
For a perturbation level $\mathrm{Score}_{M,L} \in [0,1]$ 8, simulate the worst-case deviation $\mathrm{Score}_{M,L} \in [0,1]$ 9 and its associated minimal consistency $L$ 0.
The QAC threshold $L$ 1 is defined as $L$ 2 where $L$ 3.

1.4. Accuracy-Normalized Consistency in Contrast Sets

In contrast set evaluation, “relative consistency” is the probability that a model of identical accuracy would exhibit a consistency less than or equal to the observed value (Johnson et al., 2023):

Given model accuracy $L$ 4 and raw consistency $L$ 5, let $L$ 6 be the distribution of possible bundle-level consistencies for random selections at that accuracy.
Quality-adjusted/relative consistency $L$ 7:

$L$ 8

2. Theoretical Motivation and Intuition

QAC metrics address a common shortcoming of vanilla consistency: high agreement is not always desirable if agreement is reached on low-utility or incorrect outcomes.

In multilingual LLM evaluation, a model that passes in neither language is perfectly consistent ( $L$ 9) but wholly unsatisfactory. QAC mitigates this by scaling with the weakest per-language quality score, so only models that are both consistent and competent rate highly (Ko, 6 Feb 2026).
In robot imitation learning, low variability (high consistency) across all physical trajectory metrics is required, but consistency in, e.g., highly inefficient motions, is penalized by combining it with normalized quality ranges (Sakr et al., 2024).
In pairwise comparison aggregation, QAC sets a threshold on permissible consistency relative to a bound on estimation quality. This ensures that expert group judgments are not deemed “consistent enough” unless they will, in the worst-case, yield group priorities within a prescribed error (Tsyganok et al., 2024).
For contrast sets, relative consistency interprets observed consistency in the context of accuracy, flagging models that achieve consistency only by sacrificing accuracy or exploiting label imbalance (Johnson et al., 2023).

3. Computational Procedures

Domain	Step 1	Step 2	Step 3
Multilingual LLM	Measure per-language pass rates	Compute prompt-level agreement $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 0	$\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 1
LfD Robotics	Extract 9-10 trajectory metrics	Compute per-metric range/consistency	Aggregate normalized consistency scores to form QAC
Expert Aggregation	Simulate perturbed PCMs	Search for worst-case $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 2	Set QAC threshold as $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 3
Contrast Sets	Measure accuracy and bundle-level correctness	Enumerate or sample consistency distribution	Compute $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 4

Each computational implementation is domain-specific, but all rely on combining a measure of internal consistency with a measure of output quality or calibration against a known threshold.

4. Empirical Results and Illustrative Examples

Empirical application demonstrates QAC’s value in surfacing fail cases that would be invisible to raw consistency metrics.

In “Bilingual Bias in LLMs,” GPT-4o Mini alone attains $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 5, signifying perfect, high-quality consistency. Models that are consistently propagandistic or censored (e.g., Qwen3 Max) have $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 6 despite $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 7 (Ko, 6 Feb 2026).
In robot demonstration, QAC predicts task and generalization success with $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 8 accuracy across two experimental settings. Key predictors include variability in path length and jerk—more variability lowers QAC and is tied to lower learning reliability (Sakr et al., 2024).
In expert aggregation, only those PCMs with consistency indices exceeding $\mathrm{Score}_{M,L} = \frac{1}{10} \sum_{i=1}^{10} \mathbb{1}[\mathrm{Pass}_{M,L}(i)]$ 9 guarantee, even in the worst case, an output within the user’s quality specification. For example, a $C_M \in [0,1]$ 0 deviation bound corresponds empirically to a threshold $C_M \in [0,1]$ 1 for $C_M \in [0,1]$ 2 (Tsyganok et al., 2024).
For contrast set robustness, two models with the same consistency but different accuracy will have sharply differing $C_M \in [0,1]$ 3. Raw consistency scores can mislead; relative/QAC reveals when observed consistency represents actual task robustness rather than “cheap” improvement through accuracy sacrifice (Johnson et al., 2023).

5. Interpretation, Domain Significance, and Implications

QAC provides a principled solution to the problem of “empty consistency”—cases where a method or policy appears stable and reliable only until the level of substantive correctness or task relevance is inspected.

NLP/Bias Auditing: QAC exposes language-dependent bias and penalizes uniformly incorrect or propagandistic model responses, directly informing development priorities for alignment and cross-lingual policy (Ko, 6 Feb 2026).
Robotics LfD: Screened by QAC, demonstration data can be triaged pre-training, reducing wasted computation on noisy or inconsistent examples, and focusing user feedback on problematic motion dimensions (Sakr et al., 2024).
Expert Judgement Fusion: QAC-based thresholds create a transparent, simulation-backed link between the mathematical notion of consistency and user-required decision reliability (Tsyganok et al., 2024).
Contrast Set Testing: Practitioners can differentiate models that “earn” their consistency from those that manipulate it by accuracy trade-offs, guarding against misleading comparisons in robustness studies (Johnson et al., 2023).

A plausible implication is that QAC adoption in evaluation pipelines will curtail deployment of superficially “robust” but functionally inadequate systems across domains characterized by group decision, imitation, or language transfer.

6. Limitations and Best Practices

QAC is not a universal proxy for utility:

It depends on the specification of “quality” and its operationalization—bilingual correctness, motion optimality, relative error, or accuracy.
Domain-specific implementation details (e.g., calibration statistics, red-flag definitions, perturbation models) impact QAC’s sensitivity and interpretation.
Some QAC variants require combinatorial or population-based simulations (e.g., genetic algorithm for PCM thresholds, sampling for $C_M \in [0,1]$ 4), entailing nontrivial computational cost.
Inaccuracy in “quality” side metrics (e.g., if calibration set is unrepresentative or labels are noisy) propagates into QAC mis-estimation.

Best practices include explicit reporting of all QAC computation steps and raw scores, transparent statement of task-specific thresholds, and, where computationally demanding, use of empirically justified approximations (e.g., linear fit for PCM thresholds (Tsyganok et al., 2024)).

QAC is conceptually adjacent to, but stronger than, raw consistency, and complements directional bias scores (e.g., LBS in LLM evaluation (Ko, 6 Feb 2026)) and contrast set exceedance probabilities.

LBS detects whether one condition (language, domain) outperforms another, but does not consider cross-condition reliability or absolute utility.
Standard consistency and accuracy are decoupled; QAC explicitly ties reliability to actual performance level.
In contrast set evaluation, $C_M \in [0,1]$ 5 moves beyond marginal error rates to joint robustness, preventing models from being over-credited for superficial consistency.

The QAC principle—penalize “bad but consistent” performance—has emerged independently in multiple research communities, indicating its importance as a core evaluation and thresholding concept in both machine learning and human-computation contexts.