DataCrossBench: Unified Cross-Modal Benchmark
- DataCrossBench is a unified benchmark that evaluates end-to-end analysis across heterogeneous structured data and unstructured visual artifacts.
- It employs a human-in-the-loop reverse synthesis pipeline to generate tasks with strict validation and reproducible reference insights.
- The benchmark spans domains like finance, healthcare, IT, and retail, using modular difficulty tiers to simulate realistic cross-modal challenges.
DataCrossBench is a unified benchmark for end-to-end analysis across heterogeneous, cross-modal data sources, explicitly designed to activate high-value information fragmented between directly queryable structured resources (such as SQL and CSV) and “zombie data” embedded within unstructured visual artifacts (e.g., scanned documents, invoice images). Its 200 precisely constructed tasks span domains including finance, healthcare, information technology, and commerce/retail. DataCrossBench combines a layered difficulty taxonomy, comprehensive evaluation metrics, and stringent ground-truth validation, establishing itself as a rigorous testbed for agents aspiring to deliver unified, insight-driven analytics in realistic multi-source data environments (Qi et al., 29 Jan 2026).
1. Construction Pipeline: Human-in-the-Loop Reverse Synthesis
The foundation of DataCrossBench is an iterative, human-in-the-loop “reverse-synthesis” pipeline that integrates LLM automation with expert curation to construct analysis tasks exhibiting realistic inter-source dependencies and verifiable answers. The core procedure (Algorithm 1 in (Qi et al., 29 Jan 2026)) is as follows:
- Insight Extraction: For each seed PDF report, an LLM (e.g., GPT-4o) proposes analytical goals and potential insights, which are then critically vetted by human experts to yield a definitive target insight set .
- Structured Data Synthesis: Conditioned on , a code-generating LLM is prompted to output Python and SQL scripts that produce:
- CSV tables (each rows)
- A normalized relational database schema ()
- JSON/NoSQL dumps (), and supplementary text notes ().
- Enforced constraints include at least four distinct data sources, minimum unique-value and temporal column criteria, and synthetic noise (introducing typos/unit mismatches).
- Scripts are executed to guarantee they derive exactly .
- Visual Document Generation: The same generating scripts are used to render visual tables and charts as high-fidelity PNG/PDF artifacts (), ensuring pixel/data correspondence.
- Iterative Quality Assurance: Automated toolchains check script executability and file integrity. Double-blind human review assesses cross-source necessity, logic chain uniqueness/soundness, and whether manual recomputation matches the reference answer. Tasks scoring are regenerated.
Each resulting task encapsulates a complete heterogeneous dataset , a formalized analysis goal , and ground-truth insights with reproducible reference code.
2. Dataset Structure and Modality Distribution
DataCrossBench comprises 200 tasks, with meticulously stratified coverage across verticals and difficulty tiers. The distribution is as follows:
| Domain | Easy | Hard | Total |
|---|---|---|---|
| Finance | 25 | 25 | 50 |
| Healthcare | 25 | 25 | 50 |
| Information Technology | 25 | 25 | 50 |
| Commerce/Retail | 25 | 25 | 50 |
| Total | 100 | 100 | 200 |
- Modalities: All tasks include combinations of structured (CSV, SQL, JSON, TXT) and unstructured visual data (scanned tables, charts).
- Difficulty Tiers: “Easy” (100 tasks) restricts to structured/textual sources; “Hard” (100 tasks) requires at least one image-based table or chart (mandating visual table extraction).
Each task folder provides structured files, image data with bounding-box annotations, a goal definition, ground-truth insights, and reproducible reference scripts.
3. Task Taxonomy and Analytical Requirements
Tasks are engineered to test three pivotal cross-modal capabilities: visual table extraction, cross-modal alignment, and multi-step joint reasoning. Requirements by difficulty tier:
- Easy Tier:
- Only structured/text sources involved; no images.
- Demands complex cross-file joins and aggregations (e.g., navigating between CSV, SQL, JSON, and text).
- Multi-hop reasoning such as aggregating data then comparing with reference figures in text files.
- Example:
[sales.csv] → filter → group-by → compare with [budget.csv].
- Hard Tier:
- Mandatory inclusion of (at least one visual table or chart).
- Visual table extraction: Application of image-to-DataFrame methods (e.g.,
read_image()) to parse tabular content from images. - Robust cross-modal alignment: Normalization strategies to handle OCR artifacts, disambiguate units, harmonize column labels, and link across modalities.
- Multi-step reasoning chains, exemplified by: extracting a table from an invoice image, matching “Inv#” to SQL keys, joining with payment databases, and producing root-cause analyses.
- Example:
[PDF invoice image] ––OCR––> table → normalize → join [payments.db] → analysis.
4. Evaluation Metrics and Scoring Protocol
Evaluation in DataCrossBench uses a 4-dimensional weighted scoring framework, combining factuality, completeness, logical reasoning, and insightfulness:
where .
- Factuality (): Blends exact numerical consistency
with LLM-humanized semantic ratings,
- Completeness (): Embedding-based coverage of insight space,
- Logic (): LLM-based rating (attribution, inference, comparison) normalized to [0,1].
- Insightfulness (): Assesses non-triviality and novelty leveraging a G-Eval top-5 log-prob aggregation mechanism.
Each dimension is computed per task, with the final averaged across all 200.
5. Ground-Truth Verification and Quality Control
The benchmark applies multi-layered ground-truth validation, integrating both computational and manual checks:
- Automated Sanity Checks:
- All reference SQL and Python scripts are executed within isolated sandboxes.
- Output is verified for fidelity to .
- File format integrity ensured.
- Double-Blind Human Review:
- Annotators given access to all structured data and images, with explicit reasoning chains.
- Scored against criteria: cross-source necessity, logic chain validity, and answer consistency.
- Any task not exceeding an overall quality threshold (8/10) is subjected to regeneration.
Interface layouts (Appendix Figs A1–A2) present synchronized data and image views with granular scoring capabilities, supporting transparent adjudication.
6. API Design and Practical Usage
DataCrossBench is distributed as a directory of task folders, each encapsulating all raw data, metadata, and reference solutions. Provided utilities include a Python package datacross that offers:
- TaskLoader: Abstracts file path and metadata loading.
- ImageTableExtractor: Combines Tesseract and Vision Transformer models for grid and cell localization in images.
- Column/unit normalization utilities: Eases OCR/visual–structured harmonization.
- Evaluation driver: Compares agent outputs to ground-truth via the protocol.
Example Pythonic workflow (abridged from the reference pseudocode):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from datacross import TaskLoader, ImageTableExtractor task = TaskLoader.load(task_id=7, root_dir="DataCrossBench/") img_df = ImageTableExtractor.read_image(task.images[0]) img_df.columns = normalize_columns(img_df.columns) import pandas as pd sales = pd.read_csv(task.csv_paths['sales']) merged = sales.merge(img_df, on="invoice_id", how="inner") from sqlalchemy import create_engine engine = create_engine("sqlite:///"+task.schema_sql.replace(".sql",".db")) df_sql = pd.read_sql(task.reference_sql, engine) assert set(extract_insights(merged)) == set(task.gt_insights) |
Each task includes all supporting scripts (derive.py, derive.sql), annotation files, and an exhaustive metadata record, enabling reproducible, end-to-end evaluation.
7. Significance and Implications
By activating and correlating “zombie data” in realistic cross-modal contexts, DataCrossBench provides a rigorous foundation for benchmarking agents on tasks matching industrial complexity—exemplified across domains such as finance and healthcare. Its design and methodology address critical gaps in existing analytic agent evaluation, especially in visual table extraction, semantic normalization, and cross-source reasoning. The inclusion of calibrated difficulty tiers and verifiable ground-truth pipelines ensures meaningful assessment of agent performance and robustness at scale (Qi et al., 29 Jan 2026).