TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

Published 4 Jun 2025 in cs.CL | (2506.03949v3)

Abstract: LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: https://github.com/wenge-research/TableEval.

Abstract PDF Upgrade to Chat

Summary

The paper presents TableEval, introducing a robust benchmark that integrates complex table structures, multiple languages, and domain-specific reasoning.
It uses SEAT, a structured evaluation framework comparing LLM outputs against human-verified, structured reference answers.
Experimental results reveal state-of-the-art models struggle with nested structures and multilingual challenges, highlighting the need for improved table understanding.

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering

Introduction

"TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering" presents TableEval, a benchmark designed to address limitations in current Table Question Answering (TableQA) evaluation methods. Existing benchmarks are typically limited by their focus on simple table structures and lack multilingual and multi-domain context, rendering them unsuitable for real-world applications. TableEval aims to fill this gap by introducing a comprehensive evaluation framework that integrates diverse table structures, languages, and domain-specific reasoning. The benchmark includes tables with hierarchical and nested structures sourced from various domains and multilingual scenarios (Simplified Chinese, Traditional Chinese, and English).

Dataset Construction

TableEval's dataset is constructed through a meticulous process, designed to minimize data leakage and ensure relevance to current TableQA challenges. Data is collected from financial reports, industry research, academic papers, and governmental data published in 2024, ensuring the novelty and applicability of the benchmark. The collection process involves:

Tabular Data Collection: Tables are extracted from source documents and reviewed for accuracy, ensuring alignment with the original PDFs. These tables are then categorized into types such as vertical, horizontal, matrix, hierarchical, concise, and nested structures.
Question Generation: Multiple strategies, including Template-Prompted and Role-Prompted questions, are employed to generate diverse question types associated with the tables. To ensure question variety and relevance, K-means clustering is used for sampling and deduplication.
QA Acquisition and Human Annotation: Human annotators verify each QA pair to ensure the accuracy and relevance of questions to table content. Structured answer extraction is employed to improve the precision of evaluations.
Figure 1: Overview of data collection. (1) Tabular Data Collection, collecting tables from financial reports, industry research, academic papers, and governmental data; (2) Question Generation, using Template-Prompted and Role-Prompted strategies to generate TableQA questions, filtered through clustering and deduplication; (3) QA Acquisition and Human Annotation iteratively refining answers through LLM consistency checks, human reviews, and structured answer extraction, ensuring accuracy, completeness, and alignment with the original tabular data.

Evaluation Framework - SEAT

To evaluate LLMs on TableQA tasks, SEAT (Structured Evaluation for Answers in TableQA) is proposed as a framework that uses LLMs to compare generated responses against structured reference answers. SEAT's evaluation involves:

Key Answer Extraction: Sub-questions are derived from models' responses, and these are compared with structured reference answers to ensure semantic alignment.
Structured Presentation: Outputs are presented in a JSON format, encapsulating questions, responses, and evaluation records for clarity and verification.
Figure 2: Overview of Our SEAT evaluation method.

SEAT outperforms traditional metrics by focusing on semantic correctness rather than surface-level accuracy, showcasing higher agreement with human evaluations.

Experimental Results

Tests on TableEval demonstrate significant gaps in the ability of state-of-the-art LLMs to manage complex TableQA tasks:

Performance Across Models: Closed-source models, such as Claude 3.5 and GPT-4o, generally outperform open-source ones. However, large-scale open-source models show promise with appropriate scaling.
Challenge Areas: LLMs struggle with nested/hierarchical table structures, leading to notable performance drops. Models display difficulties in domain-specific reasoning and handling multilingual data, especially for non-English scenarios.
Figure 3: Performance of LLMs across table structures.

Implications and Future Directions

TableEval and SEAT highlight critical areas for further research in TableQA:

Structure-Aware Representations: Improving models' understanding of complex table structures can enhance performance.
Enhanced Multilingual Capability: Developing methods to improve cross-lingual generalization is necessary for broader applicability.
Domain Adaptation: Domain-specific fine-tuning or pretraining could bridge existing performance gaps.

Conclusion

TableEval advances the evaluation of LLMs by introducing diverse table structures, multilingual data, and realistic benchmarking conditions. While current models show potential, especially when scaled, significant challenges remain in understanding nested structures and reasoning across diverse domains and languages. SEAT provides a robust evaluation mechanism that aligns well with human judgment, particularly for complex QA tasks. This advancement in benchmarking promises to guide future improvements in LLM capabilities for TableQA.

Figure 4: Task distribution of our TableEval.

Through TableEval and SEAT, the research community is equipped with tools to explore innovative approaches in handling complex, real-world TableQA scenarios.