Probably Approximately Correct Labels

Published 12 Jun 2025 in stat.ML and cs.LG | (2506.10908v1)

Abstract: Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such "expert" labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with LLMs, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

Abstract PDF Upgrade to Chat

Summary

The paper presents a PAC labeling approach that minimizes expert label costs by intelligently combining costly expert labels with inexpensive AI predictions.
It leverages uncertainty scores and upper confidence bounds to set a threshold ensuring that the average labeling error remains below a user-specified ε with high probability.
The methodology is extended to multi-model routing and uncertainty calibration, enabling effective, practical trade-offs between accuracy and labeling budget.

The paper "Probably Approximately Correct Labels" (2506.10908) introduces a methodology for cost-effectively creating labeled datasets by combining a small number of expensive expert labels with a large number of cheap, potentially noisy AI-generated labels. The core idea is to produce a dataset where, with high probability (at least $1-\alpha$ ), the overall labeling error is below a user-specified threshold ( $\epsilon$ ). This approach is called Probably Approximately Correct (PAC) labeling.

Core Method: PAC Labeling

The fundamental problem is to label a dataset $X_1, \dots, X_n$ to get labels $\tilde Y_1, \dots, \tilde Y_n$ such that the average loss $\frac{1}{n} \sum_{i=1}^n \ell(Y_i, \tilde Y_i)$ is at most $\epsilon$ with probability $1-\alpha$ , where $Y_i$ are the true (unknown) expert labels and $\ell$ is a loss function (e.g., 0-1 loss for classification, squared error for regression).

The method leverages an AI model $f$ that provides predictions $\hat Y_i = f(X_i)$ and associated uncertainty scores $U_i$ (typically $U_i \in [0,1]$ , with higher values indicating more uncertainty).

Objective: Minimize expert labeling cost while satisfying the PAC guarantee.
Strategy: Identify an uncertainty threshold $\hat u$ . For data points where $U_i \geq \hat u$ , expert labels $Y_i$ are collected. For points where $U_i < \hat u$ , AI predictions $\hat Y_i$ are used. So, $\tilde Y_i = Y_i \mathbf{1}\{U_i \geq \hat u\} + \hat Y_i \mathbf{1}\{U_i < \hat u\}$ .
Determining the Threshold $\hat u$ :
- Let $L^u = \frac{1}{n} \sum_{i=1}^n \ell(Y_i, \hat Y_i) \mathbf{1}\{U_i \leq u\}$ be the error incurred if AI labels are used for all points with uncertainty $U_i \le u$ .
- The ideal (oracle) threshold would be $u^* = \min\left\{U_i : L^{U_i} > \epsilon\right\}$ . Using this $u^*$ would satisfy the error criterion.
- Since $Y_i$ (and thus $L^u$ ) are unknown, $L^u$ is estimated. An upper confidence bound (UCB) $\hat L^u(\alpha)$ is computed such that $P(L^u \leq \hat L^u(\alpha)) \geq 1-\alpha$ for any given $u$ .
- The empirical threshold is then $\hat u = \min \{U_i : \hat L^{U_i}(\alpha) > \epsilon\}$ .
- Theorem 1 in the paper proves that labels generated using this $\hat u$ are PAC labels. The proof relies on the monotonicity of $L^u$ with respect to $u$ , which avoids issues with multiple comparisons across different $U_i$ .

Algorithm 1: Probably Approximately Correct Labeling outlines the practical steps:

Input: Unlabeled data $X_1, \dots, X_n$ ; AI predictions $\hat Y_1, \dots, \hat Y_n$ ; uncertainties $U_1, \dots, U_n$ ; target error $\epsilon$ ; failure probability $\alpha$ ; initial sample size $m$ ; sampling weights $\pi_1, \dots, \pi_n$ .
Procedure:
- meanUB is a subroutine that provides a $1-\alpha$ upper confidence bound for a mean. Options include:
- Non-asymptotic methods: E.g., betting-based confidence intervals (Waudby et al., 2020), which provide valid guarantees for any sample size $m$ . These are generally more conservative.
- Asymptotic methods: Based on the Central Limit Theorem (CLT), e.g., $\hat \mu_Z + z_{1-\alpha} \frac{\hat \sigma_Z}{\sqrt{m}}$ . Simpler and can lead to larger budget savings but may slightly exceed the nominal error $\epsilon$ if $m$ is too small.
- 3. Determine Threshold: $\hat u = \min\{U_i: \hat L^{U_i}(\alpha) > \epsilon\}$ .
- 4. Final Expert Labeling: Collect true labels $Y_i$ for all data points where $U_i \geq \hat u$ .
- 5. Assign Final Labels: For all $i$ , $\tilde Y_{i} \leftarrow Y_{i} \mathbf{1}\{U_i \geq \hat u\} + \hat Y_i \mathbf{1}\{U_i < \hat u\}$ .
- 6. Optionally, for points $i_j$ from the initial sample where $\xi_{i_j}=1$ , update $\tilde Y_{i_j} \leftarrow Y_{i_j}$ if they weren't already covered by $U_{i_j} \ge \hat{u}$ .
Output: Labeled dataset $(X_1, \tilde Y_1), \dots, (X_n, \tilde Y_n)$ .

def pac_labeling(X, AI_predictions, AI_uncertainties, epsilon, alpha, m, loss_fn, expert_label_fn):
    n = len(X)
    # Step 1: Initial expert labeling (simplified: uniform sampling, pi_i=1)
    initial_indices = sample_indices_uniformly(range(n), m)
    initial_expert_data = [] # Stores (loss_if_AI_used, uncertainty_score)
    for idx in initial_indices:
        Y_expert = expert_label_fn(X[idx])
        loss_val = loss_fn(Y_expert, AI_predictions[idx])
        initial_expert_data.append({'loss': loss_val, 'U': AI_uncertainties[idx], 'Y_expert': Y_expert})

    # Step 2 & 3: Compute UCBs and determine threshold u_hat
    unique_uncertainties = sorted(list(set(AI_uncertainties)))
    u_hat = unique_uncertainties[-1] # Default to labeling all if no better threshold found

    for u_candidate in unique_uncertainties:
        # For each u_candidate, calculate L_u_candidate_hat(alpha)
        # L_u_candidate = mean of (loss * 1{U <= u_candidate}) from initial_expert_data
        terms_for_meanUB = []
        for item in initial_expert_data:
            indicator = 1 if item['U'] <= u_candidate else 0
            terms_for_meanUB.append(item['loss'] * indicator)
        
        # meanUB_value = compute_mean_upper_bound(terms_for_meanUB, alpha) # e.g., CLT or betting
        # Simplified example for CLT-based UCB
        if len(terms_for_meanUB) > 1:
            mean_val = np.mean(terms_for_meanUB)
            std_dev = np.std(terms_for_meanUB)
            z_alpha = norm.ppf(1 - alpha) # Inverse CDF of standard normal
            meanUB_value = mean_val + z_alpha * std_dev / np.sqrt(len(terms_for_meanUB))
        else: # Handle small sample for meanUB
            meanUB_value = float('inf') 

        if meanUB_value > epsilon:
            u_hat = u_candidate
            break
            
    # Step 4 & 5: Final labeling
    final_labels = [None] * n
    num_expert_labels_collected = 0
    for i in range(n):
        if AI_uncertainties[i] >= u_hat:
            final_labels[i] = expert_label_fn(X[i]) # Collect expert label
            num_expert_labels_collected +=1
        else:
            final_labels[i] = AI_predictions[i] # Use AI label
            
    # (Step 6 implies ensuring initial samples are correctly labeled if expert was queried)
    # This is often handled by re-assigning Y_expert for those if u_hat was lower.
    # For simplicity, current logic just re-queries if U_i >= u_hat.

    return final_labels, num_expert_labels_collected

Uncertainty Calibration

The quality of uncertainty scores $U_i$ is crucial. If an AI model is miscalibrated (e.g., consistently overconfident in a specific data region), PAC labeling might be inefficient. Uncertainty calibration aims to improve these scores.

Method:
- Input: Uncertainties $U_1, \dots, U_m$ , expert labels $Y_1, \dots, Y_m$ , predicted labels $\hat Y_1, \dots, \hat Y_m$ (from the calibration set), clusters $\mathcal{C}$ , number of uncertainty bins $B$ , tolerance $\tau$ .
- Procedure:
- Discretize uncertainties into $B$ bins $b_j$ .
- Iteratively, for each cluster $C \in \mathcal{C}$ and each bin $b_j$ :
- Let $\mathcal{I}^{C,j} = \{i \in C : U_i \in b_j\}$ (indices of points in cluster $C$ and bin $j$ ).
- If $|\mathcal{I}^{C,j}| > 0$ , compute the correction term:
- $\Delta_{C,j} = \frac{1}{|\mathcal{I}^{C,j}|} \sum_{i \in \mathcal{I}^{C,j}} (\mathbf{1}\{Y_i \neq \hat Y_i\} - U_i)$ . This is the average difference between empirical error and uncertainty.
- If $|\Delta_{C,j}| > \tau$ , update $U_i \leftarrow U_i + \Delta_{C,j}$ for all $i \in \mathcal{I}^{C,j}$ .
- Repeat until no updates greater than $\tau$ are made.
- Output: Calibrated uncertainties $U_1, \dots, U_m$ . These calibrated uncertainties (or the learned calibration function) are then applied to the full dataset before PAC labeling.

Multi-Model Labeling via the PAC Router

When multiple AI models ( $f_1, \dots, f_k$ ) are available, each providing predictions $\hat Y_i^j$ and uncertainties $U_i^j$ , a PAC router can select the best model for each data point to minimize overall expert labeling.

Two-Step Approach:
- Learn a routing model $w_\theta(X_i)$ that outputs a probability distribution over the $k$ sources for data point $X_i$ . Use this to select the "best" source $j_i^*$ for each point, yielding $\hat Y_i = \hat Y_i^{j_i^*}$ and $U_i = U_i^{j_i^*}$ .
- Apply the standard PAC labeling procedure (Section 2 of the paper) using these routed labels and uncertainties.
Learning the Routing Model $w_\theta$ :
- A small, fully labeled routing dataset $(X_i, Y_i, \{\hat{Y}_i^j, U_i^j\}_{j=1}^k)_{i=1}^m$ is used for training $w_\theta$ .
- Simply maximizing accuracy of $w_\theta$ is suboptimal, as it ignores model uncertainties and the $\epsilon$ tolerance.
- The goal is to minimize the expected number of expert labels: $\sum_{i=1}^m \sum_{j=1}^k w_{\theta,j}(X_i) \mathbf{1}\{U_i^j \geq \hat u\}$ , where $\hat u$ is the PAC labeling threshold (which itself depends on $w_\theta$ ).
- To make this differentiable:
  - Replace the indicator $\mathbf{1}\{U_i^j \geq \hat u\}$ with a sigmoid $\sigma(U_i^j - \hat u)$ . (The paper uses $\sigma(\tilde{u} - U_i^j)$ for the error term below, and $\sigma(U_i^j - \tilde{u})$ for the cost term.)
  - Approximate $\hat u$ with a "smooth threshold" $\tilde u(\theta)$ found by solving: $\mathbb{E}_{X_i, Y_i}\left[ \sum_{j=1}^k w_{\theta,j}(X_i) \cdot \ell(Y_i, \hat{Y}_i^j) \cdot \sigma(\tilde{u}(\theta) - U_i^j)\right] = \epsilon$ . This equation defines $\tilde u$ implicitly as a function of $\theta$ .
  - The gradient $\nabla_\theta \tilde u(\theta)$ can be computed using the implicit function theorem.
  - The routing model parameters $\theta$ are then optimized by minimizing the smoothed version of the expert labeling cost using gradient descent, incorporating $\nabla_\theta \tilde u(\theta)$ .
Recalibrating Uncertainties with the Router:
- An additional uncertainty model $u_\gamma(X_i)$ can be learned simultaneously with the router $w_\theta(X_i)$ .
- The objective becomes minimizing $\sum_{i=1}^m \sum_{j=1}^k w_{\theta,j}(X_i) \cdot \ell(Y_i, \hat{Y}_i^j) \cdot \sigma(\tilde{u}(\theta, \gamma) - u_\gamma(X_i))$ , where $\tilde u$ now depends on both $\theta$ and $\gamma$ .
- Gradients $\nabla_\theta \tilde u(\theta,\gamma)$ and $\nabla_\gamma \tilde u(\theta,\gamma)$ are derived similarly.
Cost-Sensitive PAC Router:
- If AI models have different costs $c_j$ and expert labels cost $c_{\mathrm{expert}}$ , the objective changes to minimize total monetary cost: $\sum_{i=1}^m \mathbb{E}_{j \sim w_\theta(X_i)} \left[ c_j \cdot \mathbf{1}\{U_i^j < \hat u \} + c_{\mathrm{expert}} \cdot \mathbf{1}\{U_i^j \geq \hat u\} \right]$ .
- This is again made differentiable using sigmoids and implicit differentiation for $\tilde u$ .

Experiments and Applications

The paper demonstrates PAC labeling across various tasks:

Single-Model PAC Labeling:
- Setup: $\alpha=0.05$ . Performance is measured by "budget save" (percentage of points not expert-labeled) and the empirical error (which should be $\le \epsilon$ ).
- Baselines:
- "Naive": Expert label if $U_i \ge \text{fixed threshold}$ (e.g., 0.1 or 0.05).
- "AI only": Use AI labels for all points.
- Discrete Labels (0-1 loss):
- Text Annotation (GPT-4o):
- Misinformation detection (binary).
- Media headline stance on global warming (multi-class).
- Political bias of media articles (multi-class).
- Uncertainty: GPT-4o's verbalized confidence.
- meanUB: Betting algorithm.
- Image Labeling (ResNet-152):
- ImageNet, ImageNet v2.
- Uncertainty: $1 - p_{\max}(X_i)$ (max softmax output).
- Results: PAC labeling consistently met the error criterion $\epsilon$ while achieving budget saves of 14-60%. "Naive" baselines were often either too conservative (low error, low save) or violated $\epsilon$ . "AI only" had high error.
- Continuous Labels (Squared Error / MSD):
- Sentiment Analysis (GPT-4o): Predict sentiment score $[0,1]$ . Uncertainty from length of predicted interval. Loss: squared error.
- Protein Structure Prediction (AlphaFold): Predict protein structures. Uncertainty from pLDDT. Loss: Mean Squared Deviation (MSD).
- meanUB: CLT-based.
- Results: PAC labeling controlled error effectively (e.g., MSD around 0.36-1.0) with budget saves of 16-50%. "AI only" had much larger errors.
- Uncertainty Calibration:
- Tested on media bias dataset with GPT-4o. Articles clustered by source (e.g., CNN, Fox News).
- Simple calibration (Algorithm 2) improved budget save (e.g., from 13.7% to 16.7% for $\epsilon=0.05$ ).
Multi-Model PAC Labeling (PAC Router):
- Task: Media bias annotation (GPT-4o vs. Claude 3 Sonnet).
- Router trained with uncertainty model. meanUB: Betting algorithm.
- Costless Predictions: PAC router achieved 41.6% budget save, compared to ~14% for GPT-4o alone and ~8% for Claude alone (at $\epsilon=0.05$ ). The router's combined $L^u$ curve dominated individual models.
- Cost-Sensitive Predictions: With $c_{\mathrm{GPT}}=0.25$ , $c_{\mathrm{Claude}}=0.075$ , $c_{\mathrm{expert}}=1$ (relative costs), the cost-sensitive router significantly increased monetary savings compared to individual models.

Implementation Considerations

Choice of meanUB Subroutine:
- Non-asymptotic methods (like betting) offer stronger guarantees, especially for small $m$ , but might be more conservative (requiring more expert labels).
- Asymptotic methods (CLT-based) are simpler to implement and can yield higher budget saves but might slightly violate the $\epsilon$ guarantee if $m$ isn't large enough for asymptotics to hold. The appendix shows this trade-off: larger saves but errors sometimes slightly above nominal.
Quality of Uncertainty Scores: The method's effectiveness hinges on the AI model providing meaningful uncertainty scores (lower uncertainty correlating with higher accuracy). Poor uncertainties reduce potential savings.
Calibration Overhead: Uncertainty calibration requires an initial set of diverse expert labels and adds a preprocessing step. The complexity of calibration depends on the number of clusters and bins.
PAC Router Training: Learning the router $w_\theta$ (and $u_\gamma$ ) involves optimization on a labeled routing dataset, which adds computational cost. The inference is cheap (one forward pass through $w_\theta$ ).
Computational Requirements: The core PAC labeling algorithm, once the initial $m$ labels are collected and $\hat{u}$ is found, is efficient. The main cost is querying the expert.
Selection of $m$ : The size of the initial sample $m$ for estimating $\hat{L}^u(\alpha)$ is a hyperparameter. Larger $m$ gives tighter UCBs, potentially leading to a better $\hat u$ and more savings, but costs more in initial expert labels.

In summary, PAC labeling provides a theoretically grounded and practically demonstrated framework to reduce the cost of dataset annotation by intelligently combining AI predictions with expert review, all while maintaining a user-defined level of quality. The extensions for uncertainty calibration and multi-model routing further enhance its applicability and efficiency in real-world scenarios.

Markdown