Can Agentic AI Match the Performance of Human Data Scientists?

Published 24 Dec 2025 in cs.LG, cs.AI, and stat.ME | (2512.20959v1)

Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in LLMs have significantly automated data science workflows, but a fundamental question persists: Can these agentic AI systems truly match the performance of human data scientists who routinely leverage domain-specific knowledge? We explore this question by designing a prediction task where a crucial latent variable is hidden in relevant image data instead of tabular features. As a result, agentic AI that generates generic codes for modeling tabular data cannot perform well, while human experts could identify the important hidden variable using domain knowledge. We demonstrate this idea with a synthetic dataset for property insurance. Our experiments show that agentic AI that relies on generic analytics workflow falls short of methods that use domain-specific insights. This highlights a key limitation of the current agentic AI for data science and underscores the need for future research to develop agentic AI systems that can better recognize and incorporate domain knowledge.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that human data scientists using multimodal inputs achieve significantly higher predictive performance than agentic AI relying solely on tabular data.
The methodology employs a synthetic insurance dataset with a hidden latent variable, Roof Health, encoded through detailed aerial roof images to test cross-modal inference.
The findings reveal that current agentic AI pipelines are bottlenecked by their inability to extract embedded domain knowledge, underscoring the need for advanced multimodal reasoning.

Evaluation of Agentic AI Versus Human Data Scientists: Predictive Modeling with Hidden Latent Variables

Introduction

This paper addresses whether current agentic AI systems, exemplified by LLM-driven analytics agents, can match the predictive performance of expert human data scientists—especially in environments where crucial domain knowledge is embedded in modalities outside the primary tabular data. The authors construct a synthetic property insurance dataset in which a critical latent variable (Roof Health) is omitted from tabular features and only visually encoded within associated roof images. This design exposes the limitations of agentic AI strategies that rely on generic analytics code ignoring such domain-specific cues, and quantifies the empirical performance gap relative to human workflows that exploit cross-modal knowledge integration.

Synthetic Dataset Design and Data Generative Process

The authors generate data instances in three stages: (1) structured policy features are sampled using standard stochastic models (log-normal for house value, Beta for house age and risk, Bernoulli for wall type, empirical distribution for credit score); (2) a key latent variable, Roof Health, is deterministically computed from a nonlinear combination of released features, then categorized as Good, Fair, or Bad at adaptive quantile thresholds; (3) for each policy, a synthetic 1024×1024 aerial roof image is created via detailed text-to-image prompts to encode the designated Roof Health state through style, color, surface, edge, and debris descriptors.

Figure 1: Synthetic roof images faithfully encode the latent variable RoofHealth (Good, Fair, Bad), which is inaccessible from tabular data but recoverable from domain-specific visual cues.

The outcome target, next-year total loss $Y_p$ , is simulated using a compound model with a negative binomial for frequency conditioned on both structured features and the latent roof health, and a gamma model for severity, again modulated by roof state. Importantly, the only viable pathway for recovering the true risk determinant (Roof Health) is via image modality reasoning—a scenario where human practitioners systematically outperform naive analytical pipelines.

Figure 2: Data generation schema for property insurance, where critical latent variables (dotted boxes) are omitted from features and modulate both claim frequency and severity.

Empirical Evaluation: Agentic AI Pipelines Versus Human Data Scientists

To operationalize the core comparison, the authors establish three experimental tiers: (1) a standard agentic AI workflow where models are trained solely on tabular data via generic code generation (simulating current LLM agentic pipelines as in (Li et al., 2024) and (Jiang et al., 18 Feb 2025)); (2) human data scientist approaches leveraging multimodal data, including image embeddings extraction (CLIP-vectorization with direct or clustered usage), VLM-based Roof Health annotation (using gpt-4o-mini), and perfect image-based labeling (oracle-level human expertise); (3) a true oracle, granted access to the underlying generative equations and latent state.

Key performance is quantified using the normalized Gini coefficient ( $G_{\mathrm{norm}}$ ), a robust, distribution-agnostic ranking metric standard in insurance analytics.

Figure 3: Comparison of data scientist and agentic AI approaches—a 2.2× normalized Gini performance gap is observable when domain knowledge is requisite for loss prediction.

The results show that:

Agentic AI (tabular-only): $G_{\mathrm{norm}} = 0.3823$ (minimal signal capture, fails to exploit essential information in images);
Data Scientist Approaches:
- CLIP clustering: $G_{\mathrm{norm}} = 0.5042$ (modest gain if mapping is poor),
- Full CLIP embeddings: $G_{\mathrm{norm}} = 0.7719$ (substantially improved),
- Vision-language-model (gpt-4o-mini) roof health extraction: $G_{\mathrm{norm}} = 0.7271$ (high correlation, efficient cross-modal reasoning),
- Perfect roof health labeling: $G_{\mathrm{norm}} = 0.8310$ (near theoretical optimum);
Oracle: $G_{\mathrm{norm}} = 0.8379$ (Bayes-optimal, upper bound).

The results underscore the strong dependence of achievable predictive accuracy on the model’s ability to recover non-tabular, latent semantic variables—a faculty fundamentally tied to human domain interpretability and currently not emulable by generic agentic AI code pipelines.

Implications for Agentic AI Design and the Future of Automated Data Science

These findings formally demonstrate a crucial limitation of extant agentic AI systems for end-to-end data science automation: they are inherently bottlenecked by their inability to reason over, identify, and incorporate hidden or latent factors accessible through domain knowledge or cross-modal inference. The empirical gap between agentic AI and optimal human approaches, exceeding 0.44 in normalized Gini, is significant in practical insurance and risk assessment settings.

Notably, modest improvements from multimodal embeddings and recent VLMs suggest partial addressal but not closure of the performance gap. The optimal (oracle) limit is only attainable with complete access to both generative mechanisms and latent state, a scenario unreachable by simple code synthesis or typical LLM agents constrained to tabular workflows.

The results advocate for several avenues for advancing agentic AI in complex, real-world data workflows:

Multi-hop, multimodal reasoning components, effectively fusing structured and unstructured (visual/textual) information.
Active domain knowledge elicitation, hypothesis testing, and cross-modal verification steps within agentic pipelines.
Systematic benchmarks and datasets with hidden critical variables to robustly quantify gains from enhanced reasoning and domain cognition.

Conclusion

The study offers a rigorous, synthetic, but realistic analysis of where current agentic AI falls short compared to expert human data scientists. Strong numerical evidence supports the claim that reliance on generic code generation is inadequate in scenarios requiring identification and integration of latent, domain-encoded variables. This finding motivates architectural advances in agentic AI toward multimodal reasoning, active information seeking, and explicit domain knowledge incorporation as prerequisites for approaching human-level data science performance.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper asks a simple but important question: Can today’s AI “agents” that automatically write and run data science code do as well as human data scientists? The authors test this by creating a realistic challenge where a key clue is hidden in pictures, not in the spreadsheet of numbers. Humans can notice that clue using common sense and domain knowledge, but a generic AI pipeline that only looks at tables usually cannot.

What questions did the researchers ask?

They focused on two easy-to-understand questions:

If important information is only visible in images (not in the table), do generic AI data science tools miss it?
When humans add domain knowledge from the images, how much better can their predictions get?

How did they study it?

The team built a fake-but-realistic insurance dataset to carefully control the problem and make fair comparisons.

First, here’s the story behind the data:

Imagine an insurance company trying to predict how much money it will pay next year for house damage.
For each house, you get normal spreadsheet columns (like house value, age, wall type, local storm risk, and credit score).
You also get an overhead photo of the home’s roof.
There is a hidden variable called RoofHealth (Good, Fair, or Bad). This matters a lot for future damage, but it is NOT written anywhere in the spreadsheet. It can only be guessed by looking at the roof photo (e.g., missing shingles, damaged ridge lines).

What they compared:

Generic agentic AI approach: Use only the table (no images) with a standard model, like current AI agents often do.
Human-like approaches: Use both the table and the images in sensible ways, such as:
- Turning images into features with a vision model (CLIP) and feeding those features into a prediction model.
- Asking a vision-LLM to label RoofHealth from the image, then using that label.
- Using perfect RoofHealth labels, as if an expert human had checked every image.
Oracle (best possible): Use the exact formulas that generated the data plus the true RoofHealth. This sets the upper limit of how well anyone could do.

How they judged success:

They used a ranking score called the normalized Gini coefficient (from 0 to 1).
Think of it like grading how well your predictions order houses from most risky to least risky. A score near 1 means your ranking is almost perfect; 0 means you’re basically guessing.

What did they find and why is it important?

Key results show a clear pattern: using domain knowledge from images makes a big difference.

Generic AI pipeline (tables only): normalized Gini ≈ 0.38
- This is not great because it ignores the roof photos where the crucial clue lives.
Human-like use of images:
- Clustered image features into rough categories: ≈ 0.50 (some improvement)
- Full image features (CLIP) added to the model: ≈ 0.77 (big jump)
- RoofHealth auto-labeled by a vision-LLM: ≈ 0.73 (also strong)
- True RoofHealth labels (as if a careful human labeled every photo): ≈ 0.83 (near the best possible)
Oracle (best achievable): ≈ 0.84

Why this matters:

The jump from ≈ 0.38 (tables only) to ≈ 0.77–0.83 (when images/roof health are used) shows that domain knowledge and multimodal clues are critical. If the AI ignores the photos, it misses the most important signal.

Why this matters and what’s next?

This study shows that current “agentic” AI for data science often follows a generic, table-only recipe. That works for many tasks, but it falls short when vital information is hidden in other formats (like images). Human data scientists, who know what to look for and how to combine different kinds of data, can do much better.

What this means going forward:

AI tools need to get better at noticing when important clues might be in images, text, audio, or other sources—not just spreadsheets.
They should learn to bring in domain knowledge (like “roof condition affects storm damage risk”) and to extract the right signals from the right data type.
Future research should build AI agents that can ask for the missing pieces, read images intelligently, and reason like a domain expert, so they can close the gap with skilled humans.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper highlights an important performance gap but leaves several areas unresolved. Future research could address the following:

External validity: Does the observed gap persist on real property insurance data with UAV/satellite roof imagery (variable angles, occlusions, weathering, resolution, and noise) rather than prompt-generated images?
Human validation: To what extent can human experts reliably infer RoofHealth from real images (accuracy, inter-rater reliability), and at what time/cost? No user study or annotation experiment is reported.
Agentic AI capability scope: The “agentic AI (generic pipeline)” baseline omits multimodal tool use. How well do state-of-the-art agent agents that can call vision models, perform end-to-end multimodal learning, or retrieve domain knowledge compare to human experts?
Benchmark breadth: The study uses a single synthetic task/domain. Can a broader benchmark spanning multiple domains and modalities (text, audio, time series, geospatial) with hidden latent factors generalize the conclusions?
Robustness to design choices: How sensitive are results to the strength of the latent factor, class prevalence thresholds (55%/80%), noise level in RoofHealth, and the parameters in the claim frequency/severity processes?
Metric coverage: Normalized Gini is appropriate for ranking, but how do conclusions change under calibration metrics (e.g., calibration slope/ECE), tail-aware losses (quantile/Pinball), MAE/MAPE, or decision-centric metrics (profit/lift curves)?
Model diversity: Only random forests are evaluated. How do gradient boosting (XGBoost/LightGBM/CatBoost), GLMs/GBMs tailored for insurance, zero-inflated and compound models, deep tabular networks, and modern multimodal architectures perform?
End-to-end multimodal learning: Can joint image–tabular models trained end-to-end (e.g., CNN/ViT fusion with tabular MLP/GBM, late/early fusion) close the gap without explicit “domain knowledge” labeling?
Detection of missing modalities: How can an agent autonomously diagnose when tabular features are insufficient and decide to query/use additional modalities (images/text) or domain expertise?
Acquisition of domain knowledge: What strategies enable agents to recognize relevant domain concepts (e.g., RoofHealth) and operationalize them (RAG over actuarial literature, tool invocation for image analysis, structured reasoning/planning)?
Image generation validity: Do prompt-engineered descriptors introduce spurious cues or shortcuts that CLIP/VLMs exploit? Is roof style/color distribution balanced to avoid unintended leakage?
Cross-model variance: How consistent are results across different VLMs (e.g., CLIP variants, BLIP, SigLIP, GPT-4o, Gemini, Claude VLM) and across prompt formulations for label extraction?
Statistical rigor: Results are reported as point estimates. Are differences statistically significant across multiple random seeds, bootstrapped confidence intervals, and repeated train/test splits?
Sample size scaling: With only 2,000 policies, how does the performance gap evolve with larger datasets, different train/test ratios, and class imbalance variations?
Reproducibility and release: Are the dataset, generation prompts, model versions, random seeds, and code publicly released with licenses to enable replication and extension?
Oracle realism: The oracle assumes perfect knowledge of the generative process and RoofHealth. Can a learnable surrogate (e.g., correctly specified GLM for frequency/severity with latent inference) approximate this upper bound in practice?
Fairness and bias: RoofHealth is a function of features including CreditScore. How do models (human or agentic) handle potential socio-economic bias, and what fairness/civil rights implications arise in real applications?
Cost–benefit analysis: What is the economic impact of collecting, storing, and analyzing roof imagery versus gains in predictive performance (ROI, latency, compute cost)?
Distribution shift: How robust are methods to covariate/label shift (e.g., region-specific roofing materials, climate changes, new building codes) and to real-world image artifacts?
Generality of latent placement: Beyond images, how do results change when the crucial latent variable is embedded in other modalities (adjuster notes, sensor streams, text reports), or when multiple interacting latent variables exist?

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below is a concise list of applications that can be deployed now, mapped to relevant sectors and accompanied by key dependencies and assumptions.

Insurance (Property): Roof imagery–augmented underwriting and pricing
- Description: Add a derived “RoofHealth” feature to existing tabular models by extracting signals from aerial or street-view images (e.g., CLIP embeddings, VLM labels). Integrate into GLM/GBM/RF workflows to uplift ranking performance for loss prediction (e.g., normalized Gini).
- Tools/Workflows: Pretrained image encoders (CLIP), lightweight VLMs (e.g., gpt-4o-mini), feature fusion with tabular pipelines, model monitoring with normalized Gini.
- Dependencies/Assumptions: Access to recent high-resolution roof imagery; basic MLOps; legal permissions for image use; image quality variability and geographic coverage.
Insurance (Claims Operations): Event-driven claims triage using pre/post-event roof imagery
- Description: Prioritize inspections and reserves by scoring roof damage likelihood after storms, combining historical policy features with image-derived condition.
- Tools/Workflows: Batch scoring pipelines; damage-change detection; rules plus ML prioritization.
- Dependencies/Assumptions: Timely imagery acquisition; storm footprint maps; operational SLAs; threshold tuning for heavy-tailed losses.
Software/Data Science Enablement: Multimodal feature engineering library for tabular + image fusion
- Description: Package the paper’s approach as a reusable Python library/notebook templates to help DS teams incorporate image features into tabular models.
- Tools/Workflows: CLIP feature extraction, clustering/labeling, RF/GBM integration, normalized Gini evaluator.
- Dependencies/Assumptions: Maintained model weights; governance around image preprocessing; reproducible pipelines.
Human–AI Teaming: Domain-knowledge checklists and agent guardrails
- Description: Introduce prompt/system-level guardrails so DS agents must consider non-tabular modalities and domain latent variables before finalizing models.
- Tools/Workflows: Agent task templates (e.g., “search for latent variables outside tabular data”); review steps; escalation to human experts.
- Dependencies/Assumptions: Organizational buy-in; minimal agent customization; clear escalation paths.
Academia (Teaching/Training): Coursework and labs on multimodal domain-aware data science
- Description: Use the synthetic dataset to teach students how to identify latent variables and fuse image signals with tabular features.
- Tools/Workflows: Course modules; Kaggle-style labs; grading via normalized Gini; ablation exercises.
- Dependencies/Assumptions: Access to teaching compute; permissive licenses for pretrained models; student familiarity with ML basics.
Benchmarking & Evaluation (ML Platforms): Synthetic, controllable tasks to probe agent limitations
- Description: Extend DS-bench/MLE-bench with multimodal tasks where key signals reside outside tabular features, benchmarking agents against human baselines.
- Tools/Workflows: Synthetic data generators; challenge suites; leaderboards using normalized Gini.
- Dependencies/Assumptions: Benchmark governance; reproducible seeds; consistent scoring and disclosure policies.
Procurement/Model Governance (Policy & Compliance within firms): Vendor requirements for multimodal capability
- Description: Update RFPs and model risk standards to require evidence of multimodal ingestion and domain-knowledge incorporation for high-stakes predictions.
- Tools/Workflows: Checklists; validation protocols; documentation of feature provenance.
- Dependencies/Assumptions: Internal policy updates; model risk functions; audit trails.
UAV/Imagery Vendors (Energy, Construction, Real Estate): “RoofHealth-as-a-Service” API
- Description: Offer an API that returns standardized roof condition scores from aerial imagery for insurers, property managers, and assessors.
- Tools/Workflows: Inference services; SLAs on latency/accuracy; calibration per geography.
- Dependencies/Assumptions: Reliable data pipelines; client integration; legal compliance (privacy, airspace).
Metrics & Monitoring (Cross-sector DS): Adopt normalized Gini for ranking heavy-tailed targets
- Description: Use normalized Gini alongside AUC and RMSE to monitor rank-quality in claims, fraud, energy outages, and other heavy-tailed tasks.
- Tools/Workflows: Metric modules; dashboards; sensitivity analyses.
- Dependencies/Assumptions: Stakeholder education on metric interpretation; baseline establishment.

Long-Term Applications

These applications require further research, scaling, or development before broad deployment.

Agentic AI (Software): Domain-aware, multimodal DS agents that discover and acquire missing signals
- Description: Agents that autonomously hypothesize latent variables (e.g., RoofHealth), request/collect appropriate modalities (images, text, sensors), and integrate them into models with explainable fusion.
- Potential Products: “Active sensing” DS agents; multimodal AutoML with ontology-guided retrieval.
- Dependencies/Assumptions: Robust multimodal reasoning; data acquisition orchestration; explainability and safety checks; integration with enterprise data lakes.
Cross-Sector Expansion (Healthcare, Agriculture, Energy, Finance): Latent-variable inference from non-tabular streams
- Description: Generalize the paper’s mechanism to radiology + EHR (health risk), crop imagery + weather (yield/loss), asset imagery + telemetry (outage risk), satellite photos + POS (retail footfall).
- Potential Products: Sector-specific condition scoring APIs (e.g., “Transformer Health Score”), multimodal risk platforms.
- Dependencies/Assumptions: Domain-specific ontologies; regulatory approvals (HIPAA, GDPR); high-quality multimodal data; bias mitigation.
Synthetic Data Frameworks (Academia/Industry Consortia): Realistic multimodal benchmarks with controllable latent factors
- Description: Build large, open synthetic suites that mirror hidden-variable challenges across sectors; standardize evaluation protocols.
- Potential Products: Public benchmark hubs; curriculum datasets; testbeds for agents.
- Dependencies/Assumptions: Community governance; dataset documentation; licensing; compute for generation.
Learning at Scale (Vision/ML): Self-supervised and weakly supervised roof-health models trained on real imagery
- Description: Train robust models that infer RoofHealth across geographies, materials, and conditions, reducing dependence on manual labels or synthetic prompts.
- Potential Products: Foundation models for built-environment assessment; fine-tuned VLMs for structural integrity.
- Dependencies/Assumptions: Large diverse datasets; annotation strategies; domain drift handling; continual learning.
Explainability & Auditability (Policy/RegTech): Multimodal explanations for underwriting and claims models
- Description: Develop methods that attribute predictions to image regions/features and tabular variables, meeting regulatory scrutiny for fairness and transparency.
- Potential Products: XAI toolkits for multimodal ranking models; audit dashboards.
- Dependencies/Assumptions: Accepted XAI standards; regulator guidance; human factors research on explanation usability.
Federated & Privacy-Preserving Learning (Insurance, Healthcare): Multimodal modeling without centralized data sharing
- Description: Train across institutions using federated/secure aggregation to protect imagery and PII while improving model robustness.
- Potential Products: Federated multimodal platforms; privacy-enhancing technologies (DP, secure enclaves).
- Dependencies/Assumptions: Partner alignment; network reliability; legal agreements; performance trade-offs.
Edge AI on Drones/IoT (Energy, Construction): Real-time structural condition scoring
- Description: Run compressed VLMs on drones to assess roof/asset health on-site, feeding scores directly into maintenance/underwriting systems.
- Potential Products: Drone-integrated inference kits; onsite triage workflows.
- Dependencies/Assumptions: Hardware acceleration; battery/compute constraints; safety protocols; model robustness in the wild.
Fairness & Policy Frameworks (Public Policy): Standards for multimodal risk scoring
- Description: Establish guidelines on equitable use of imagery in pricing and claims (e.g., avoiding socioeconomic proxies), disclosure, and contestability.
- Potential Products: Industry codes of practice; compliance templates; audit programs.
- Dependencies/Assumptions: Multi-stakeholder input; empirical bias assessments; regulator adoption.
Economic Impact Studies (Industry/Academia): Cost–benefit analysis of multimodal uplift
- Description: Quantify ROI of integrating image-derived features vs. data acquisition/storage costs and operational complexity.
- Potential Products: Decision-support models; procurement playbooks.
- Dependencies/Assumptions: Access to historical outcomes; robust causal analysis; scenario modeling.
Generalized Multimodal Benchmarks for Agentic AI (ML Ecosystem): Standard tasks testing “beyond-tabular” competence
- Description: Expand DSBench/MLE-bench with multimodal, hidden-variable tasks to drive research on agents that reason over domain cues.
- Potential Products: Annual challenges; leaderboard ecosystems; shared baselines.
- Dependencies/Assumptions: Community engagement; reproducibility; funding for hosting and maintenance.

View Paper Prompt View All Prompts

Glossary

Actuarial practice: Professional methods and standards used to model insurance risk and losses. "The construction of our synthetic property insurance dataset is grounded in established actuarial practice and empirical research."
Agentic AI: Autonomous AI systems that plan and execute multi-step workflows, such as automated data science pipelines. "agentic AI that relies on generic analytics workflow falls short of methods that use domain-specific insights."
Bayes-optimal expected loss: The expected loss under the true generative model that minimizes expected error. "This gives the Bayesâoptimal expected loss and any remaining error reflects only inherent randomness in claims."
Beta distribution: A continuous probability distribution on [0,1] often used to model proportions or probabilities. "AreaRisk: $X_{\mathrm{risk},p}\sim\mathrm{Beta}(2,5)$ "
Bernoulli distribution: A discrete distribution for a single binary outcome, typically parameterized by a success probability. "WallType: $X_{\mathrm{wall},p}\sim \textrm{Bernoulli}(\text{Wood},\text{Brick})$ "
CLIP: A pretrained model (Contrastive Language–Image Pretraining) that learns joint representations of images and text. "pretrained CLIP model"
CLIP embeddings: Vector representations produced by CLIP’s encoders that capture semantic information from images or text. "Using naive clustering of CLIP embeddings as categorical features provides some improvement"
Compound frequency–severity model: An insurance modeling framework that combines a distribution for claim counts (frequency) with a distribution for claim sizes (severity). "The target outcome, next-year loss, is generated using a compound frequency-severity model."
FICO distribution: The empirical distribution of US credit scores used as a proxy for financial risk. "CreditScore: $X_{\mathrm{cred},p}$ drawn from the US FICO distribution (300â850)."
Gamma distribution: A continuous distribution for positive-valued data, commonly used to model claim severities. "Z_{p,j} \sim \Gamma!\bigl(k=2,\;\theta=\exp(\mu_p)/2\bigr)\,."
Latent variable: An unobserved variable that influences the data-generating process and outcomes. "The key latent variable is RoofHealth"
Lognormal distribution: A distribution where the logarithm of the variable is normally distributed, often used for skewed monetary values. "HouseValue: $X_{\mathrm{val},p}\sim\mathrm{LogNormal}(12.9,\,0.45)$ "
Multimodal data: Data that combines multiple modalities such as images and tabular features. "improve agentic AI's ability to identify and use domain-specific knowledge from multimodal data sources."
Negative binomial distribution: A discrete distribution for count data that handles overdispersion relative to the Poisson. "N_p \sim \mathrm{NegBinom}\bigl(r=10,\;\text{mean}=\lambda_p\bigr)\,."
Normalized Gini coefficient: A rank-based performance metric (scaled to [-1,1]) widely used in insurance to evaluate predictive models. "We measure predictive performance using normalized Gini coefficient"
Oracle model: A benchmark model that has access to the true generative process and latent variables, representing best achievable performance. "Oracle (Best achievable)"
RoofHealth: A domain-specific categorical variable indicating roof condition (Good, Fair, Bad) that impacts insurance losses. "RoofHealth (latent): compute"
Text-to-image model: A generative model that synthesizes images from natural language prompts. "we use a text-to-image model with engineered prompts"
Vision-LLM: A model that jointly processes visual and textual inputs to perform tasks like classification or labeling from images. "vision-LLM gpt-4o-mini"

Can Agentic AI Match the Performance of Human Data Scientists?

Summary

Evaluation of Agentic AI Versus Human Data Scientists: Predictive Modeling with Hidden Latent Variables

Introduction

Synthetic Dataset Design and Data Generative Process

Empirical Evaluation: Agentic AI Pipelines Versus Human Data Scientists

Implications for Agentic AI Design and the Future of Automated Data Science

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they study it?

What did they find and why is it important?

Why this matters and what’s next?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (12)

Collections

Tweets

YouTube

Can Agentic AI Match the Performance of Human Data Scientists?

Summary

Evaluation of Agentic AI Versus Human Data Scientists: Predictive Modeling with Hidden Latent Variables

Introduction

Synthetic Dataset Design and Data Generative Process

Empirical Evaluation: Agentic AI Pipelines Versus Human Data Scientists

Implications for Agentic AI Design and the Future of Automated Data Science

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they study it?

What did they find and why is it important?

Why this matters and what’s next?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (12)

Collections

Tweets

YouTube