Probably Approximately Correct Labels
Abstract: Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such "expert" labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with LLMs, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.
Summary
- The paper presents a PAC labeling approach that minimizes expert label costs by intelligently combining costly expert labels with inexpensive AI predictions.
- It leverages uncertainty scores and upper confidence bounds to set a threshold ensuring that the average labeling error remains below a user-specified ε with high probability.
- The methodology is extended to multi-model routing and uncertainty calibration, enabling effective, practical trade-offs between accuracy and labeling budget.
The paper "Probably Approximately Correct Labels" (2506.10908) introduces a methodology for cost-effectively creating labeled datasets by combining a small number of expensive expert labels with a large number of cheap, potentially noisy AI-generated labels. The core idea is to produce a dataset where, with high probability (at least 1−α), the overall labeling error is below a user-specified threshold (ϵ). This approach is called Probably Approximately Correct (PAC) labeling.
Core Method: PAC Labeling
The fundamental problem is to label a dataset X1,…,Xn to get labels Y~1,…,Y~n such that the average loss n1∑i=1nℓ(Yi,Y~i) is at most ϵ with probability 1−α, where Yi are the true (unknown) expert labels and ℓ is a loss function (e.g., 0-1 loss for classification, squared error for regression).
The method leverages an AI model f that provides predictions Y^i=f(Xi) and associated uncertainty scores Ui (typically Ui∈[0,1], with higher values indicating more uncertainty).
- Objective: Minimize expert labeling cost while satisfying the PAC guarantee.
- Strategy: Identify an uncertainty threshold u^. For data points where Ui≥u^, expert labels Yi are collected. For points where Ui<u^, AI predictions Y^i are used. So, Y~i=Yi1{Ui≥u^}+Y^i1{Ui<u^}.
- Determining the Threshold u^:
- Let Lu=n1i=1∑nℓ(Yi,Y^i)1{Ui≤u} be the error incurred if AI labels are used for all points with uncertainty Ui≤u.
- The ideal (oracle) threshold would be u∗=min{Ui:LUi>ϵ}. Using this u∗ would satisfy the error criterion.
- Since Yi (and thus Lu) are unknown, Lu is estimated. An upper confidence bound (UCB) L^u(α) is computed such that P(Lu≤L^u(α))≥1−α for any given u.
- The empirical threshold is then u^=min{Ui:L^Ui(α)>ϵ}.
- Theorem 1 in the paper proves that labels generated using this u^ are PAC labels. The proof relies on the monotonicity of Lu with respect to u, which avoids issues with multiple comparisons across different Ui.
Algorithm 1: Probably Approximately Correct Labeling outlines the practical steps:
- Input: Unlabeled data X1,…,Xn; AI predictions Y^1,…,Y^n; uncertainties U1,…,Un; target error ϵ; failure probability α; initial sample size m; sampling weights π1,…,πn.
- Procedure:
meanUBis a subroutine that provides a 1−α upper confidence bound for a mean. Options include:- Non-asymptotic methods: E.g., betting-based confidence intervals (Waudby et al., 2020), which provide valid guarantees for any sample size m. These are generally more conservative.
- Asymptotic methods: Based on the Central Limit Theorem (CLT), e.g., μ^Z+z1−αmσ^Z. Simpler and can lead to larger budget savings but may slightly exceed the nominal error ϵ if m is too small.
- 3. Determine Threshold: u^=min{Ui:L^Ui(α)>ϵ}.
- 4. Final Expert Labeling: Collect true labels Yi for all data points where Ui≥u^.
- 5. Assign Final Labels: For all i, Y~i←Yi1{Ui≥u^}+Y^i1{Ui<u^}.
- 6. Optionally, for points ij from the initial sample where ξij=1, update Y~ij←Yij if they weren't already covered by Uij≥u^.
- Output: Labeled dataset (X1,Y~1),…,(Xn,Y~n).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
def pac_labeling(X, AI_predictions, AI_uncertainties, epsilon, alpha, m, loss_fn, expert_label_fn): n = len(X) # Step 1: Initial expert labeling (simplified: uniform sampling, pi_i=1) initial_indices = sample_indices_uniformly(range(n), m) initial_expert_data = [] # Stores (loss_if_AI_used, uncertainty_score) for idx in initial_indices: Y_expert = expert_label_fn(X[idx]) loss_val = loss_fn(Y_expert, AI_predictions[idx]) initial_expert_data.append({'loss': loss_val, 'U': AI_uncertainties[idx], 'Y_expert': Y_expert}) # Step 2 & 3: Compute UCBs and determine threshold u_hat unique_uncertainties = sorted(list(set(AI_uncertainties))) u_hat = unique_uncertainties[-1] # Default to labeling all if no better threshold found for u_candidate in unique_uncertainties: # For each u_candidate, calculate L_u_candidate_hat(alpha) # L_u_candidate = mean of (loss * 1{U <= u_candidate}) from initial_expert_data terms_for_meanUB = [] for item in initial_expert_data: indicator = 1 if item['U'] <= u_candidate else 0 terms_for_meanUB.append(item['loss'] * indicator) # meanUB_value = compute_mean_upper_bound(terms_for_meanUB, alpha) # e.g., CLT or betting # Simplified example for CLT-based UCB if len(terms_for_meanUB) > 1: mean_val = np.mean(terms_for_meanUB) std_dev = np.std(terms_for_meanUB) z_alpha = norm.ppf(1 - alpha) # Inverse CDF of standard normal meanUB_value = mean_val + z_alpha * std_dev / np.sqrt(len(terms_for_meanUB)) else: # Handle small sample for meanUB meanUB_value = float('inf') if meanUB_value > epsilon: u_hat = u_candidate break # Step 4 & 5: Final labeling final_labels = [None] * n num_expert_labels_collected = 0 for i in range(n): if AI_uncertainties[i] >= u_hat: final_labels[i] = expert_label_fn(X[i]) # Collect expert label num_expert_labels_collected +=1 else: final_labels[i] = AI_predictions[i] # Use AI label # (Step 6 implies ensuring initial samples are correctly labeled if expert was queried) # This is often handled by re-assigning Y_expert for those if u_hat was lower. # For simplicity, current logic just re-queries if U_i >= u_hat. return final_labels, num_expert_labels_collected |
Uncertainty Calibration
The quality of uncertainty scores Ui is crucial. If an AI model is miscalibrated (e.g., consistently overconfident in a specific data region), PAC labeling might be inefficient. Uncertainty calibration aims to improve these scores.
- Method:
- Input: Uncertainties U1,…,Um, expert labels Y1,…,Ym, predicted labels Y^1,…,Y^m (from the calibration set), clusters C, number of uncertainty bins B, tolerance τ.
- Procedure:
- Discretize uncertainties into B bins bj.
- Iteratively, for each cluster C∈C and each bin bj:
- Let IC,j={i∈C:Ui∈bj} (indices of points in cluster C and bin j).
- If ∣IC,j∣>0, compute the correction term:
- ΔC,j=∣IC,j∣1i∈IC,j∑(1{Yi=Y^i}−Ui). This is the average difference between empirical error and uncertainty.
- If ∣ΔC,j∣>τ, update Ui←Ui+ΔC,j for all i∈IC,j.
- Repeat until no updates greater than τ are made.
- Output: Calibrated uncertainties U1,…,Um. These calibrated uncertainties (or the learned calibration function) are then applied to the full dataset before PAC labeling.
Multi-Model Labeling via the PAC Router
When multiple AI models (f1,…,fk) are available, each providing predictions Y^ij and uncertainties Uij, a PAC router can select the best model for each data point to minimize overall expert labeling.
- Two-Step Approach:
- Learn a routing model wθ(Xi) that outputs a probability distribution over the k sources for data point Xi. Use this to select the "best" source ji∗ for each point, yielding Y^i=Y^iji∗ and Ui=Uiji∗.
- Apply the standard PAC labeling procedure (Section 2 of the paper) using these routed labels and uncertainties.
- Learning the Routing Model wθ:
- A small, fully labeled routing dataset (Xi,Yi,{Y^ij,Uij}j=1k)i=1m is used for training wθ.
- Simply maximizing accuracy of wθ is suboptimal, as it ignores model uncertainties and the ϵ tolerance.
- The goal is to minimize the expected number of expert labels: i=1∑mj=1∑kwθ,j(Xi)1{Uij≥u^}, where u^ is the PAC labeling threshold (which itself depends on wθ).
- To make this differentiable:
- Replace the indicator 1{Uij≥u^} with a sigmoid σ(Uij−u^). (The paper uses σ(u~−Uij) for the error term below, and σ(Uij−u~) for the cost term.)
- Approximate u^ with a "smooth threshold" u~(θ) found by solving: EXi,Yi[j=1∑kwθ,j(Xi)⋅ℓ(Yi,Y^ij)⋅σ(u~(θ)−Uij)]=ϵ. This equation defines u~ implicitly as a function of θ.
- The gradient ∇θu~(θ) can be computed using the implicit function theorem.
- The routing model parameters θ are then optimized by minimizing the smoothed version of the expert labeling cost using gradient descent, incorporating ∇θu~(θ).
- Recalibrating Uncertainties with the Router:
- An additional uncertainty model uγ(Xi) can be learned simultaneously with the router wθ(Xi).
- The objective becomes minimizing i=1∑mj=1∑kwθ,j(Xi)⋅ℓ(Yi,Y^ij)⋅σ(u~(θ,γ)−uγ(Xi)), where u~ now depends on both θ and γ.
- Gradients ∇θu~(θ,γ) and ∇γu~(θ,γ) are derived similarly.
- Cost-Sensitive PAC Router:
- If AI models have different costs cj and expert labels cost cexpert, the objective changes to minimize total monetary cost: i=1∑mEj∼wθ(Xi)[cj⋅1{Uij<u^}+cexpert⋅1{Uij≥u^}].
- This is again made differentiable using sigmoids and implicit differentiation for u~.
Experiments and Applications
The paper demonstrates PAC labeling across various tasks:
- Single-Model PAC Labeling:
- Setup: α=0.05. Performance is measured by "budget save" (percentage of points not expert-labeled) and the empirical error (which should be ≤ϵ).
- Baselines:
- "Naive": Expert label if Ui≥fixed threshold (e.g., 0.1 or 0.05).
- "AI only": Use AI labels for all points.
- Discrete Labels (0-1 loss):
- Text Annotation (GPT-4o):
- Misinformation detection (binary).
- Media headline stance on global warming (multi-class).
- Political bias of media articles (multi-class).
- Uncertainty: GPT-4o's verbalized confidence.
meanUB: Betting algorithm.- Image Labeling (ResNet-152):
- ImageNet, ImageNet v2.
- Uncertainty: 1−pmax(Xi) (max softmax output).
- Results: PAC labeling consistently met the error criterion ϵ while achieving budget saves of 14-60%. "Naive" baselines were often either too conservative (low error, low save) or violated ϵ. "AI only" had high error.
- Continuous Labels (Squared Error / MSD):
- Sentiment Analysis (GPT-4o): Predict sentiment score [0,1]. Uncertainty from length of predicted interval. Loss: squared error.
- Protein Structure Prediction (AlphaFold): Predict protein structures. Uncertainty from pLDDT. Loss: Mean Squared Deviation (MSD).
meanUB: CLT-based.- Results: PAC labeling controlled error effectively (e.g., MSD around 0.36-1.0) with budget saves of 16-50%. "AI only" had much larger errors.
- Uncertainty Calibration:
- Tested on media bias dataset with GPT-4o. Articles clustered by source (e.g., CNN, Fox News).
- Simple calibration (Algorithm 2) improved budget save (e.g., from 13.7% to 16.7% for ϵ=0.05).
- Multi-Model PAC Labeling (PAC Router):
- Task: Media bias annotation (GPT-4o vs. Claude 3 Sonnet).
- Router trained with uncertainty model.
meanUB: Betting algorithm. - Costless Predictions: PAC router achieved 41.6% budget save, compared to ~14% for GPT-4o alone and ~8% for Claude alone (at ϵ=0.05). The router's combined Lu curve dominated individual models.
- Cost-Sensitive Predictions: With cGPT=0.25, cClaude=0.075, cexpert=1 (relative costs), the cost-sensitive router significantly increased monetary savings compared to individual models.
Implementation Considerations
- Choice of
meanUBSubroutine:- Non-asymptotic methods (like betting) offer stronger guarantees, especially for small m, but might be more conservative (requiring more expert labels).
- Asymptotic methods (CLT-based) are simpler to implement and can yield higher budget saves but might slightly violate the ϵ guarantee if m isn't large enough for asymptotics to hold. The appendix shows this trade-off: larger saves but errors sometimes slightly above nominal.
- Quality of Uncertainty Scores: The method's effectiveness hinges on the AI model providing meaningful uncertainty scores (lower uncertainty correlating with higher accuracy). Poor uncertainties reduce potential savings.
- Calibration Overhead: Uncertainty calibration requires an initial set of diverse expert labels and adds a preprocessing step. The complexity of calibration depends on the number of clusters and bins.
- PAC Router Training: Learning the router wθ (and uγ) involves optimization on a labeled routing dataset, which adds computational cost. The inference is cheap (one forward pass through wθ).
- Computational Requirements: The core PAC labeling algorithm, once the initial m labels are collected and u^ is found, is efficient. The main cost is querying the expert.
- Selection of m: The size of the initial sample m for estimating L^u(α) is a hyperparameter. Larger m gives tighter UCBs, potentially leading to a better u^ and more savings, but costs more in initial expert labels.
In summary, PAC labeling provides a theoretically grounded and practically demonstrated framework to reduce the cost of dataset annotation by intelligently combining AI predictions with expert review, all while maintaining a user-defined level of quality. The extensions for uncertainty calibration and multi-model routing further enhance its applicability and efficiency in real-world scenarios.
Paper to Video (Beta)
No one has generated a video about this paper yet.
Whiteboard
No one has generated a whiteboard explanation for this paper yet.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Open Problems
We haven't generated a list of open problems mentioned in this paper yet.
Continue Learning
- How does the choice of loss function (e.g., 0-1 loss vs. squared error) impact the performance and guarantees of PAC labeling in different tasks?
- What practical strategies can be used to ensure AI model uncertainties are well-calibrated before applying PAC labeling, especially in domains with complex or subtle errors?
- How does the efficiency and accuracy of the PAC router compare to other ensemble or aggregation methods for combining multiple AI models in supervision-reduction settings?
- What are the trade-offs between non-asymptotic (e.g., betting-based) and asymptotic (CLT-based) meanUB subroutines in terms of expert labeling cost and error control in practical scenarios?
- Find recent papers about cost-effective data labeling with uncertainty-aware AI models.
Related Papers
- Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation (2023)
- Automated Annotation with Generative AI Requires Validation (2023)
- Prediction-Oriented Bayesian Active Learning (2023)
- Consistency of Neural Causal Partial Identification (2024)
- Uncertainty propagation in feed-forward neural network models (2025)
- Machine Learning: a Lecture Note (2025)
- Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening (2025)
- Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs (2025)
- Approximating Language Model Training Data from Weights (2025)
- Statistical and Algorithmic Foundations of Reinforcement Learning (2025)
Authors (3)
Collections
Sign up for free to add this paper to one or more collections.
Tweets
Sign up for free to view the 2 tweets with 39 likes about this paper.