Great Models Think Alike and this Undermines AI Oversight

Published 6 Feb 2025 in cs.LG, cs.AI, and cs.CL | (2502.04313v2)

Abstract: As LLM (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other LLMs can automate both these tasks, which we refer to as ''AI Oversight''. We study how model similarity affects both aspects of AI oversight by proposing Chance Adjusted Probabilistic Agreement (CAPA): a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from ''weak-to-strong generalization''. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CAPA, a novel metric measuring functional similarity between language models by using full probability distributions and adjusting for chance agreement.
Experiments show a positive correlation between model similarity (using CAPA) and scores assigned by LLM judges, indicating an affinity bias where judges favor similar models.
Lower similarity between a weak supervisor and strong student correlates with greater training gains, while overall model capability improvements show increased correlation in model errors.

The paper presents a thorough investigation into the implications of functional similarity between LMs on both evaluation and training procedures in large-scale AI oversight. It introduces a novel similarity metric—CAPA (Chance Adjusted Probabilistic Agreement)—designed to overcome limitations of prior error consistency metrics by incorporating both the full probability distribution of model outputs and adjustments for chance agreement driven by model accuracy.

The summary below highlights the central contributions and experimental findings:

Novel Metric for Model Similarity (CAPA)

Motivation and Design:

The CAPA metric was developed to measure functional similarity between LMs in a probabilistic manner. Unlike standard error consistency techniques (which count only discrete prediction matches), CAPA leverages the complete predicted probability distribution. This allows the metric to account for cases when two models provide non-identical but closely aligned probability vectors (e.g., outputs such as [0.49, 0.51] vs. [0.51, 0.49]).

Formalization:

CAPA defines the observed agreement as $c_{\text{obs}} = \frac{1}{|D|}\sum_{x\in D}\sum_{o\in O(x)} p_1(o) \cdot p_2(o),$ where $p_1(o)$ and $p_2(o)$ are the predicted probabilities for each option on a data point $x$ . To adjust for chance agreement, the chance agreement is computed by assuming that independently calibrated models assign a fixed probability $\overline{p}$ to the correct option and uniformly distribute the residual probability mass among remaining options. The final CAPA score is then given by $p = \frac{c_{\text{obs}} - c_{\text{exp}}}{1 - c_{\text{exp}}},$ where $c_{\text{exp}}$ is the expected agreement under statistical independence. This normalization ensures that $p$ lies between -1 and 1, with higher values indicating greater overlap in mistakes.

Desiderata:
- Accuracy Adjustment: Corrects for inflated similarity due to high acccuracy.
- Mistake Specificity: Differentiates between correct overlapping predictions and coincidental agreement when models err.
- Probability Incorporation: Uses model output probabilities rather than only hard labels, providing a finer-grained measure.

Implications for AI Oversight

The study investigates two primary areas where model similarity plays a critical role:

LLM-as-a-Judge Bias (Affinity Bias):
- Experimental Setup:
A series of experiments were conducted on the MMLU-Pro benchmark. Multiple judge models evaluated free-form responses provided by candidate models, with the process paralleling standard multiple-choice (MCQ) evaluation. Similarity scores between judge and candidate models were computed using CAPA. - Findings: - A significant positive correlation was observed between the CAPA similarity score and the judgment score assigned by an LM acting as a judge. - Statistical analyses, including partial correlations and multiple regression (controlling for actual model accuracy), consistently indicated that judges assign higher scores to models which are functionally similar to themselves. - This affinity bias is analogous to the well-known human tendency where evaluators favor candidates exhibiting similar cognitive or stylistic traits.
Weak-to-Strong Generalization via LM Annotations:
- Experimental Setup:
In a weak-to-strong training paradigm, a smaller “weak” supervisor model annotates data that is then used to fine-tune a larger “strong” student model. The study examines how the functional similarity (or the complementary differences) between the supervising and student models affects performance gains. - Findings: - An inverse correlation is found between CAPA similarity and performance gain: lower similarity (indicating higher complementary knowledge) between the weak supervisor and the strong student correlates with greater improvements in the student’s accuracy. - Through a decomposition analysis on test set performance, the paper distinguishes gains gained purely from elicitation (recovering latent capabilities) versus those attributable to the transfer of complementary knowledge. - These results suggest that leveraging complementary error patterns can push the performance envelope beyond previously established ceilings.

Trends in LM Error Correlation with Increasing Capabilities

Observation:

An important additional finding is that as LM capabilities improve, the errors they commit become more correlated. When models are binned by accuracy, higher performing groups exhibit significantly greater CAPA scores, implying that the diversity in model failures narrows with capability gains.

Implications:
- Increased error overlap raises concerns for AI oversight, since diversity among models is critical both for unbiased evaluation (as exemplified in the LLM-as-a-judge experiments) and for effective knowledge distillation in training.
- The trend for correlated errors may induce risks related to common blind spots across models, potentially leading to systemic failures when relying solely on automated oversight.

Additional Theoretical and Empirical Insights

Metric Comparisons:

The paper presents a detailed comparison between CAPA and other inter-rater and divergence metrics (e.g., Cohen’s kappa, Scott’s pi, KL divergence). Simulations illustrate that while traditional metrics may misestimate similarity—especially when model outputs are highly calibrated—CAPA consistently captures the nuanced variations in functional agreement.

Extensions and Applicability:

Extensions of CAPA to multi-model comparisons, as well as adaptations for classification and exact match settings, are provided. These derivations underline the flexibility of the metric beyond MCQ-based evaluations.

Statistical Rigor:

Rigorous statistical testing (including partial correlation, multiple regression analyses, diagnostic tests for normality and homoscedasticity) reaffirms the robustness of the reported effects of model similarity on AI oversight mechanisms, both in judging and training contexts.

Overall, the paper emphasizes that as LMs continue to scale, considering functional similarity—and its implications on both model evaluation and knowledge transfer methodologies—is increasingly critical. The CAPA metric offers a comprehensive framework for analyzing these effects, highlighting that correlated errors could undermine the independence and effectiveness of AI oversight, especially in scenarios where diverse approaches are necessary to mitigate shared systematic failures.

Markdown Report Issue