- The paper introduces CAPA, a novel metric measuring functional similarity between language models by using full probability distributions and adjusting for chance agreement.
- Experiments show a positive correlation between model similarity (using CAPA) and scores assigned by LLM judges, indicating an affinity bias where judges favor similar models.
- Lower similarity between a weak supervisor and strong student correlates with greater training gains, while overall model capability improvements show increased correlation in model errors.
The paper presents a thorough investigation into the implications of functional similarity between LMs on both evaluation and training procedures in large-scale AI oversight. It introduces a novel similarity metric—CAPA (Chance Adjusted Probabilistic Agreement)—designed to overcome limitations of prior error consistency metrics by incorporating both the full probability distribution of model outputs and adjustments for chance agreement driven by model accuracy.
The summary below highlights the central contributions and experimental findings:
Novel Metric for Model Similarity (CAPA)
The CAPA metric was developed to measure functional similarity between LMs in a probabilistic manner. Unlike standard error consistency techniques (which count only discrete prediction matches), CAPA leverages the complete predicted probability distribution. This allows the metric to account for cases when two models provide non-identical but closely aligned probability vectors (e.g., outputs such as [0.49, 0.51] vs. [0.51, 0.49]).
CAPA defines the observed agreement as
cobs=∣D∣1x∈D∑o∈O(x)∑p1(o)⋅p2(o),
where p1(o) and p2(o) are the predicted probabilities for each option on a data point x.
To adjust for chance agreement, the chance agreement is computed by assuming that independently calibrated models assign a fixed probability p to the correct option and uniformly distribute the residual probability mass among remaining options. The final CAPA score is then given by
p=1−cexpcobs−cexp,
where cexp is the expected agreement under statistical independence. This normalization ensures that p lies between -1 and 1, with higher values indicating greater overlap in mistakes.
- Desiderata:
- Accuracy Adjustment: Corrects for inflated similarity due to high acccuracy.
- Mistake Specificity: Differentiates between correct overlapping predictions and coincidental agreement when models err.
- Probability Incorporation: Uses model output probabilities rather than only hard labels, providing a finer-grained measure.
Implications for AI Oversight
The study investigates two primary areas where model similarity plays a critical role:
- LLM-as-a-Judge Bias (Affinity Bias):
A series of experiments were conducted on the MMLU-Pro benchmark. Multiple judge models evaluated free-form responses provided by candidate models, with the process paralleling standard multiple-choice (MCQ) evaluation. Similarity scores between judge and candidate models were computed using CAPA.
- Findings:
- A significant positive correlation was observed between the CAPA similarity score and the judgment score assigned by an LM acting as a judge.
- Statistical analyses, including partial correlations and multiple regression (controlling for actual model accuracy), consistently indicated that judges assign higher scores to models which are functionally similar to themselves.
- This affinity bias is analogous to the well-known human tendency where evaluators favor candidates exhibiting similar cognitive or stylistic traits.
Weak-to-Strong Generalization via LM Annotations:
In a weak-to-strong training paradigm, a smaller “weak” supervisor model annotates data that is then used to fine-tune a larger “strong” student model. The study examines how the functional similarity (or the complementary differences) between the supervising and student models affects performance gains.
- Findings:
- An inverse correlation is found between CAPA similarity and performance gain: lower similarity (indicating higher complementary knowledge) between the weak supervisor and the strong student correlates with greater improvements in the student’s accuracy.
- Through a decomposition analysis on test set performance, the paper distinguishes gains gained purely from elicitation (recovering latent capabilities) versus those attributable to the transfer of complementary knowledge.
- These results suggest that leveraging complementary error patterns can push the performance envelope beyond previously established ceilings.
Trends in LM Error Correlation with Increasing Capabilities
An important additional finding is that as LM capabilities improve, the errors they commit become more correlated. When models are binned by accuracy, higher performing groups exhibit significantly greater CAPA scores, implying that the diversity in model failures narrows with capability gains.
- Implications:
- Increased error overlap raises concerns for AI oversight, since diversity among models is critical both for unbiased evaluation (as exemplified in the LLM-as-a-judge experiments) and for effective knowledge distillation in training.
- The trend for correlated errors may induce risks related to common blind spots across models, potentially leading to systemic failures when relying solely on automated oversight.
Additional Theoretical and Empirical Insights
The paper presents a detailed comparison between CAPA and other inter-rater and divergence metrics (e.g., Cohen’s kappa, Scott’s pi, KL divergence). Simulations illustrate that while traditional metrics may misestimate similarity—especially when model outputs are highly calibrated—CAPA consistently captures the nuanced variations in functional agreement.
- Extensions and Applicability:
Extensions of CAPA to multi-model comparisons, as well as adaptations for classification and exact match settings, are provided. These derivations underline the flexibility of the metric beyond MCQ-based evaluations.
Rigorous statistical testing (including partial correlation, multiple regression analyses, diagnostic tests for normality and homoscedasticity) reaffirms the robustness of the reported effects of model similarity on AI oversight mechanisms, both in judging and training contexts.
Overall, the paper emphasizes that as LMs continue to scale, considering functional similarity—and its implications on both model evaluation and knowledge transfer methodologies—is increasingly critical. The CAPA metric offers a comprehensive framework for analyzing these effects, highlighting that correlated errors could undermine the independence and effectiveness of AI oversight, especially in scenarios where diverse approaches are necessary to mitigate shared systematic failures.