Papers
Topics
Authors
Recent
Search
2000 character limit reached

Probably Approximately Correct Labels

Published 12 Jun 2025 in stat.ML and cs.LG | (2506.10908v1)

Abstract: Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such "expert" labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with LLMs, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

Summary

  • The paper presents a PAC labeling approach that minimizes expert label costs by intelligently combining costly expert labels with inexpensive AI predictions.
  • It leverages uncertainty scores and upper confidence bounds to set a threshold ensuring that the average labeling error remains below a user-specified ε with high probability.
  • The methodology is extended to multi-model routing and uncertainty calibration, enabling effective, practical trade-offs between accuracy and labeling budget.

The paper "Probably Approximately Correct Labels" (2506.10908) introduces a methodology for cost-effectively creating labeled datasets by combining a small number of expensive expert labels with a large number of cheap, potentially noisy AI-generated labels. The core idea is to produce a dataset where, with high probability (at least 1α1-\alpha), the overall labeling error is below a user-specified threshold (ϵ\epsilon). This approach is called Probably Approximately Correct (PAC) labeling.

Core Method: PAC Labeling

The fundamental problem is to label a dataset X1,,XnX_1, \dots, X_n to get labels Y~1,,Y~n\tilde Y_1, \dots, \tilde Y_n such that the average loss 1ni=1n(Yi,Y~i)\frac{1}{n} \sum_{i=1}^n \ell(Y_i, \tilde Y_i) is at most ϵ\epsilon with probability 1α1-\alpha, where YiY_i are the true (unknown) expert labels and \ell is a loss function (e.g., 0-1 loss for classification, squared error for regression).

The method leverages an AI model ff that provides predictions Y^i=f(Xi)\hat Y_i = f(X_i) and associated uncertainty scores UiU_i (typically Ui[0,1]U_i \in [0,1], with higher values indicating more uncertainty).

  1. Objective: Minimize expert labeling cost while satisfying the PAC guarantee.
  2. Strategy: Identify an uncertainty threshold u^\hat u. For data points where Uiu^U_i \geq \hat u, expert labels YiY_i are collected. For points where Ui<u^U_i < \hat u, AI predictions Y^i\hat Y_i are used. So, Y~i=Yi1{Uiu^}+Y^i1{Ui<u^}\tilde Y_i = Y_i \mathbf{1}\{U_i \geq \hat u\} + \hat Y_i \mathbf{1}\{U_i < \hat u\}.
  3. Determining the Threshold u^\hat u:
    • Let Lu=1ni=1n(Yi,Y^i)1{Uiu}L^u = \frac{1}{n} \sum_{i=1}^n \ell(Y_i, \hat Y_i) \mathbf{1}\{U_i \leq u\} be the error incurred if AI labels are used for all points with uncertainty UiuU_i \le u.
    • The ideal (oracle) threshold would be u=min{Ui:LUi>ϵ}u^* = \min\left\{U_i : L^{U_i} > \epsilon\right\}. Using this uu^* would satisfy the error criterion.
    • Since YiY_i (and thus LuL^u) are unknown, LuL^u is estimated. An upper confidence bound (UCB) L^u(α)\hat L^u(\alpha) is computed such that P(LuL^u(α))1αP(L^u \leq \hat L^u(\alpha)) \geq 1-\alpha for any given uu.
    • The empirical threshold is then u^=min{Ui:L^Ui(α)>ϵ}\hat u = \min \{U_i : \hat L^{U_i}(\alpha) > \epsilon\}.
    • Theorem 1 in the paper proves that labels generated using this u^\hat u are PAC labels. The proof relies on the monotonicity of LuL^u with respect to uu, which avoids issues with multiple comparisons across different UiU_i.

Algorithm 1: Probably Approximately Correct Labeling outlines the practical steps:

  • Input: Unlabeled data X1,,XnX_1, \dots, X_n; AI predictions Y^1,,Y^n\hat Y_1, \dots, \hat Y_n; uncertainties U1,,UnU_1, \dots, U_n; target error ϵ\epsilon; failure probability α\alpha; initial sample size mm; sampling weights π1,,πn\pi_1, \dots, \pi_n.
  • Procedure:
    • meanUB is a subroutine that provides a 1α1-\alpha upper confidence bound for a mean. Options include:
    • Non-asymptotic methods: E.g., betting-based confidence intervals (Waudby et al., 2020), which provide valid guarantees for any sample size mm. These are generally more conservative.
    • Asymptotic methods: Based on the Central Limit Theorem (CLT), e.g., μ^Z+z1ασ^Zm\hat \mu_Z + z_{1-\alpha} \frac{\hat \sigma_Z}{\sqrt{m}}. Simpler and can lead to larger budget savings but may slightly exceed the nominal error ϵ\epsilon if mm is too small.
    • 3. Determine Threshold: u^=min{Ui:L^Ui(α)>ϵ}\hat u = \min\{U_i: \hat L^{U_i}(\alpha) > \epsilon\}.
    • 4. Final Expert Labeling: Collect true labels YiY_i for all data points where Uiu^U_i \geq \hat u.
    • 5. Assign Final Labels: For all ii, Y~iYi1{Uiu^}+Y^i1{Ui<u^}\tilde Y_{i} \leftarrow Y_{i} \mathbf{1}\{U_i \geq \hat u\} + \hat Y_i \mathbf{1}\{U_i < \hat u\}.
    • 6. Optionally, for points iji_j from the initial sample where ξij=1\xi_{i_j}=1, update Y~ijYij\tilde Y_{i_j} \leftarrow Y_{i_j} if they weren't already covered by Uiju^U_{i_j} \ge \hat{u}.
  • Output: Labeled dataset (X1,Y~1),,(Xn,Y~n)(X_1, \tilde Y_1), \dots, (X_n, \tilde Y_n).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def pac_labeling(X, AI_predictions, AI_uncertainties, epsilon, alpha, m, loss_fn, expert_label_fn):
    n = len(X)
    # Step 1: Initial expert labeling (simplified: uniform sampling, pi_i=1)
    initial_indices = sample_indices_uniformly(range(n), m)
    initial_expert_data = [] # Stores (loss_if_AI_used, uncertainty_score)
    for idx in initial_indices:
        Y_expert = expert_label_fn(X[idx])
        loss_val = loss_fn(Y_expert, AI_predictions[idx])
        initial_expert_data.append({'loss': loss_val, 'U': AI_uncertainties[idx], 'Y_expert': Y_expert})

    # Step 2 & 3: Compute UCBs and determine threshold u_hat
    unique_uncertainties = sorted(list(set(AI_uncertainties)))
    u_hat = unique_uncertainties[-1] # Default to labeling all if no better threshold found

    for u_candidate in unique_uncertainties:
        # For each u_candidate, calculate L_u_candidate_hat(alpha)
        # L_u_candidate = mean of (loss * 1{U <= u_candidate}) from initial_expert_data
        terms_for_meanUB = []
        for item in initial_expert_data:
            indicator = 1 if item['U'] <= u_candidate else 0
            terms_for_meanUB.append(item['loss'] * indicator)
        
        # meanUB_value = compute_mean_upper_bound(terms_for_meanUB, alpha) # e.g., CLT or betting
        # Simplified example for CLT-based UCB
        if len(terms_for_meanUB) > 1:
            mean_val = np.mean(terms_for_meanUB)
            std_dev = np.std(terms_for_meanUB)
            z_alpha = norm.ppf(1 - alpha) # Inverse CDF of standard normal
            meanUB_value = mean_val + z_alpha * std_dev / np.sqrt(len(terms_for_meanUB))
        else: # Handle small sample for meanUB
            meanUB_value = float('inf') 

        if meanUB_value > epsilon:
            u_hat = u_candidate
            break
            
    # Step 4 & 5: Final labeling
    final_labels = [None] * n
    num_expert_labels_collected = 0
    for i in range(n):
        if AI_uncertainties[i] >= u_hat:
            final_labels[i] = expert_label_fn(X[i]) # Collect expert label
            num_expert_labels_collected +=1
        else:
            final_labels[i] = AI_predictions[i] # Use AI label
            
    # (Step 6 implies ensuring initial samples are correctly labeled if expert was queried)
    # This is often handled by re-assigning Y_expert for those if u_hat was lower.
    # For simplicity, current logic just re-queries if U_i >= u_hat.

    return final_labels, num_expert_labels_collected

Uncertainty Calibration

The quality of uncertainty scores UiU_i is crucial. If an AI model is miscalibrated (e.g., consistently overconfident in a specific data region), PAC labeling might be inefficient. Uncertainty calibration aims to improve these scores.

  • Method:
    • Input: Uncertainties U1,,UmU_1, \dots, U_m, expert labels Y1,,YmY_1, \dots, Y_m, predicted labels Y^1,,Y^m\hat Y_1, \dots, \hat Y_m (from the calibration set), clusters C\mathcal{C}, number of uncertainty bins BB, tolerance τ\tau.
    • Procedure:
    • Discretize uncertainties into BB bins bjb_j.
    • Iteratively, for each cluster CCC \in \mathcal{C} and each bin bjb_j:
    • Let IC,j={iC:Uibj}\mathcal{I}^{C,j} = \{i \in C : U_i \in b_j\} (indices of points in cluster CC and bin jj).
    • If IC,j>0|\mathcal{I}^{C,j}| > 0, compute the correction term:
    • ΔC,j=1IC,jiIC,j(1{YiY^i}Ui)\Delta_{C,j} = \frac{1}{|\mathcal{I}^{C,j}|} \sum_{i \in \mathcal{I}^{C,j}} (\mathbf{1}\{Y_i \neq \hat Y_i\} - U_i). This is the average difference between empirical error and uncertainty.
    • If ΔC,j>τ|\Delta_{C,j}| > \tau, update UiUi+ΔC,jU_i \leftarrow U_i + \Delta_{C,j} for all iIC,ji \in \mathcal{I}^{C,j}.
    • Repeat until no updates greater than τ\tau are made.
    • Output: Calibrated uncertainties U1,,UmU_1, \dots, U_m. These calibrated uncertainties (or the learned calibration function) are then applied to the full dataset before PAC labeling.

Multi-Model Labeling via the PAC Router

When multiple AI models (f1,,fkf_1, \dots, f_k) are available, each providing predictions Y^ij\hat Y_i^j and uncertainties UijU_i^j, a PAC router can select the best model for each data point to minimize overall expert labeling.

  1. Two-Step Approach:
    • Learn a routing model wθ(Xi)w_\theta(X_i) that outputs a probability distribution over the kk sources for data point XiX_i. Use this to select the "best" source jij_i^* for each point, yielding Y^i=Y^iji\hat Y_i = \hat Y_i^{j_i^*} and Ui=UijiU_i = U_i^{j_i^*}.
    • Apply the standard PAC labeling procedure (Section 2 of the paper) using these routed labels and uncertainties.
  2. Learning the Routing Model wθw_\theta:
    • A small, fully labeled routing dataset (Xi,Yi,{Y^ij,Uij}j=1k)i=1m(X_i, Y_i, \{\hat{Y}_i^j, U_i^j\}_{j=1}^k)_{i=1}^m is used for training wθw_\theta.
    • Simply maximizing accuracy of wθw_\theta is suboptimal, as it ignores model uncertainties and the ϵ\epsilon tolerance.
    • The goal is to minimize the expected number of expert labels: i=1mj=1kwθ,j(Xi)1{Uiju^}\sum_{i=1}^m \sum_{j=1}^k w_{\theta,j}(X_i) \mathbf{1}\{U_i^j \geq \hat u\}, where u^\hat u is the PAC labeling threshold (which itself depends on wθw_\theta).
    • To make this differentiable:
      • Replace the indicator 1{Uiju^}\mathbf{1}\{U_i^j \geq \hat u\} with a sigmoid σ(Uiju^)\sigma(U_i^j - \hat u). (The paper uses σ(u~Uij)\sigma(\tilde{u} - U_i^j) for the error term below, and σ(Uiju~)\sigma(U_i^j - \tilde{u}) for the cost term.)
      • Approximate u^\hat u with a "smooth threshold" u~(θ)\tilde u(\theta) found by solving: EXi,Yi[j=1kwθ,j(Xi)(Yi,Y^ij)σ(u~(θ)Uij)]=ϵ\mathbb{E}_{X_i, Y_i}\left[ \sum_{j=1}^k w_{\theta,j}(X_i) \cdot \ell(Y_i, \hat{Y}_i^j) \cdot \sigma(\tilde{u}(\theta) - U_i^j)\right] = \epsilon. This equation defines u~\tilde u implicitly as a function of θ\theta.
      • The gradient θu~(θ)\nabla_\theta \tilde u(\theta) can be computed using the implicit function theorem.
      • The routing model parameters θ\theta are then optimized by minimizing the smoothed version of the expert labeling cost using gradient descent, incorporating θu~(θ)\nabla_\theta \tilde u(\theta).
  3. Recalibrating Uncertainties with the Router:
    • An additional uncertainty model uγ(Xi)u_\gamma(X_i) can be learned simultaneously with the router wθ(Xi)w_\theta(X_i).
    • The objective becomes minimizing i=1mj=1kwθ,j(Xi)(Yi,Y^ij)σ(u~(θ,γ)uγ(Xi))\sum_{i=1}^m \sum_{j=1}^k w_{\theta,j}(X_i) \cdot \ell(Y_i, \hat{Y}_i^j) \cdot \sigma(\tilde{u}(\theta, \gamma) - u_\gamma(X_i)), where u~\tilde u now depends on both θ\theta and γ\gamma.
    • Gradients θu~(θ,γ)\nabla_\theta \tilde u(\theta,\gamma) and γu~(θ,γ)\nabla_\gamma \tilde u(\theta,\gamma) are derived similarly.
  4. Cost-Sensitive PAC Router:
    • If AI models have different costs cjc_j and expert labels cost cexpertc_{\mathrm{expert}}, the objective changes to minimize total monetary cost: i=1mEjwθ(Xi)[cj1{Uij<u^}+cexpert1{Uiju^}]\sum_{i=1}^m \mathbb{E}_{j \sim w_\theta(X_i)} \left[ c_j \cdot \mathbf{1}\{U_i^j < \hat u \} + c_{\mathrm{expert}} \cdot \mathbf{1}\{U_i^j \geq \hat u\} \right].
    • This is again made differentiable using sigmoids and implicit differentiation for u~\tilde u.

Experiments and Applications

The paper demonstrates PAC labeling across various tasks:

  • Single-Model PAC Labeling:
    • Setup: α=0.05\alpha=0.05. Performance is measured by "budget save" (percentage of points not expert-labeled) and the empirical error (which should be ϵ\le \epsilon).
    • Baselines:
    • "Naive": Expert label if Uifixed thresholdU_i \ge \text{fixed threshold} (e.g., 0.1 or 0.05).
    • "AI only": Use AI labels for all points.
    • Discrete Labels (0-1 loss):
    • Text Annotation (GPT-4o):
    • Misinformation detection (binary).
    • Media headline stance on global warming (multi-class).
    • Political bias of media articles (multi-class).
    • Uncertainty: GPT-4o's verbalized confidence.
    • meanUB: Betting algorithm.
    • Image Labeling (ResNet-152):
    • ImageNet, ImageNet v2.
    • Uncertainty: 1pmax(Xi)1 - p_{\max}(X_i) (max softmax output).
    • Results: PAC labeling consistently met the error criterion ϵ\epsilon while achieving budget saves of 14-60%. "Naive" baselines were often either too conservative (low error, low save) or violated ϵ\epsilon. "AI only" had high error.
    • Continuous Labels (Squared Error / MSD):
    • Sentiment Analysis (GPT-4o): Predict sentiment score [0,1][0,1]. Uncertainty from length of predicted interval. Loss: squared error.
    • Protein Structure Prediction (AlphaFold): Predict protein structures. Uncertainty from pLDDT. Loss: Mean Squared Deviation (MSD).
    • meanUB: CLT-based.
    • Results: PAC labeling controlled error effectively (e.g., MSD around 0.36-1.0) with budget saves of 16-50%. "AI only" had much larger errors.
    • Uncertainty Calibration:
    • Tested on media bias dataset with GPT-4o. Articles clustered by source (e.g., CNN, Fox News).
    • Simple calibration (Algorithm 2) improved budget save (e.g., from 13.7% to 16.7% for ϵ=0.05\epsilon=0.05).
  • Multi-Model PAC Labeling (PAC Router):
    • Task: Media bias annotation (GPT-4o vs. Claude 3 Sonnet).
    • Router trained with uncertainty model. meanUB: Betting algorithm.
    • Costless Predictions: PAC router achieved 41.6% budget save, compared to ~14% for GPT-4o alone and ~8% for Claude alone (at ϵ=0.05\epsilon=0.05). The router's combined LuL^u curve dominated individual models.
    • Cost-Sensitive Predictions: With cGPT=0.25c_{\mathrm{GPT}}=0.25, cClaude=0.075c_{\mathrm{Claude}}=0.075, cexpert=1c_{\mathrm{expert}}=1 (relative costs), the cost-sensitive router significantly increased monetary savings compared to individual models.

Implementation Considerations

  • Choice of meanUB Subroutine:
    • Non-asymptotic methods (like betting) offer stronger guarantees, especially for small mm, but might be more conservative (requiring more expert labels).
    • Asymptotic methods (CLT-based) are simpler to implement and can yield higher budget saves but might slightly violate the ϵ\epsilon guarantee if mm isn't large enough for asymptotics to hold. The appendix shows this trade-off: larger saves but errors sometimes slightly above nominal.
  • Quality of Uncertainty Scores: The method's effectiveness hinges on the AI model providing meaningful uncertainty scores (lower uncertainty correlating with higher accuracy). Poor uncertainties reduce potential savings.
  • Calibration Overhead: Uncertainty calibration requires an initial set of diverse expert labels and adds a preprocessing step. The complexity of calibration depends on the number of clusters and bins.
  • PAC Router Training: Learning the router wθw_\theta (and uγu_\gamma) involves optimization on a labeled routing dataset, which adds computational cost. The inference is cheap (one forward pass through wθw_\theta).
  • Computational Requirements: The core PAC labeling algorithm, once the initial mm labels are collected and u^\hat{u} is found, is efficient. The main cost is querying the expert.
  • Selection of mm: The size of the initial sample mm for estimating L^u(α)\hat{L}^u(\alpha) is a hyperparameter. Larger mm gives tighter UCBs, potentially leading to a better u^\hat u and more savings, but costs more in initial expert labels.

In summary, PAC labeling provides a theoretically grounded and practically demonstrated framework to reduce the cost of dataset annotation by intelligently combining AI predictions with expert review, all while maintaining a user-defined level of quality. The extensions for uncertainty calibration and multi-model routing further enhance its applicability and efficiency in real-world scenarios.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 39 likes about this paper.