FRTR-Bench: Multimodal Spreadsheet Evaluation
- FRTR-Bench is a benchmark for evaluating multimodal reasoning in enterprise spreadsheets, integrating text data with embedded images such as charts and dashboards.
- It systematically tests cross-sheet retrieval, formula synthesis, and visual extraction tasks through a hybrid approach combining BM25 and cosine similarity.
- Empirical evaluations demonstrate significant accuracy improvements over prior benchmarks, emphasizing the importance of multimodal fusion in spreadsheet intelligence.
FRTR-Bench is a large-scale benchmark specifically constructed for evaluating multimodal reasoning in enterprise-scale spreadsheets. It enables rigorous, comparative assessment of LLMs and multimodal architectures on real-world, complex Excel workbooks containing both tabular data and embedded visual content. Unlike prior text-only or single-sheet benchmarks, FRTR-Bench emphasizes cross-modal retrieval, cross-sheet reasoning, and visual information extraction, serving as an authoritative resource for studies in spreadsheet intelligence and multimodal NLP (Gulati et al., 13 Jan 2026).
1. Dataset Composition and Structure
FRTR-Bench comprises 30 enterprise-grade Excel workbooks, incorporating a total of 155 worksheets (approximately 5.2 sheets per workbook), 656,457 rows, and 3,928,934 cells. Each workbook averages nearly 21,882 rows and 130,964 cells. Visual content is represented with 53 embedded images—including charts, receipts, scanned tables, and dashboards—presented as standalone PNG files with captions, which are embedded via a vision encoder and unified in a cross-modal vector index alongside table-text units. There are 30 cross-sheet formulas and a total of 157 curated queries.
| Metric | Value |
|---|---|
| Workbooks | 30 |
| Worksheets | 155 |
| Rows | 656,457 |
| Cells | 3,928,934 |
| Embedded Images | 53 |
| Cross-Sheet Formulas | 30 |
| Total Questions | 157 |
The benchmark categorizes data complexity with "difficulty tiers" based on row count: Easy (<5,000 rows), Medium (5,000–20,000 rows), and Hard (20,000–210,000 rows). Each workbook includes on average 5.2 questions.
2. Task Taxonomy
FRTR-Bench queries are systematically designed across five primary reasoning categories:
- Lookup & Retrieval: Direct extraction of values or references from specified cells, e.g., “What is the value in Sheet2!C13?”
- Numeric Aggregation & Formula Synthesis: Computation over data regions, typically requesting synthesized Excel formulas or aggregate results, e.g., “Calculate total revenue for Q4 in 2023.”
- Cross-Sheet References: Integration of data from multiple sheets, requiring composition of references or functions across sheet boundaries, e.g., obtaining consolidated operating income via formulaic aggregation of columns and cells across sheets.
- Visual Value Extraction: Numeric extraction from embedded images such as receipts, e.g., extracting a tax amount visible in a scanned receipt image.
- Visual Pattern & Trend Description: Natural language summarization of data trends found within charts or plot images, e.g., “Describe the trend shown in Chart_007.”
This taxonomy enables evaluation of both strictly textual and multimodal (text-vision) reasoning—an advancement over existing single-modal spreadsheet QA benchmarks.
3. Evaluation Protocol and Metrics
The benchmark mandates standardized evaluation protocols, including:
- Answer Accuracy:
- Latency (seconds): Wall-clock time from user query to model response, excluding retrieval time.
- Mean Tokens: Average number of input tokens after retrieval.
- Optional for Sub-Components:
- Precision:
- Recall:
- F1 Score:
Evaluation aggregates unweighted per-question accuracy across the set, and researchers are advised to report error analysis by task category (numeric vs. visual).
4. Benchmark Statistics and Comparison to Prior Work
FRTR-Bench advances over prior spreadsheet benchmarks by a substantial margin in both scope and multimodality. For comparison:
- SpreadsheetLLM: Single-sheet, text-only, average context ≈6,000 tokens, 0 embedded images.
- FRTR-Bench: Multi-sheet, multimodal (text + images), average retrieved context ≈7,800 tokens.
Table: Select Model Performance on FRTR-Bench
| Method | Model | Answer Accuracy | Mean Tokens | Latency (s) |
|---|---|---|---|---|
| SpreadsheetLLM | Claude Sonnet 4.5 | 0.13 | 12,744.6 | 8.80 |
| FRTR | Claude Sonnet 4.5 | 0.74 | 7,690.9 | 11.71 |
| FRTR | GPT-5 | 0.73 | 7,690.9 | 15.50 |
- Prior SOTA (compression-based) peaked at ≈24% across models; FRTR approaches reach 74% with Claude Sonnet 4.5.
- Difficulty breakdown (top models): Easy (0.86–0.93), Medium (0.62–0.65), Hard (0.66–0.72).
- On SpreadsheetLLM, FRTR achieves up to 87% answer accuracy (GPT-5), with token usage halved compared to context-compression methods.
This performance gap elucidates the limitations of previous full-context or compression methods and demonstrates the importance of hybrid retrieval and multimodal fusion for complex spreadsheet QA.
5. Technical Workflow and Best Practices
Researchers utilizing FRTR-Bench are recommended to adhere to the following protocol:
- Data Splits: Employ the full set of 157 queries as an evaluation set; reserve 20% of workbooks for development, 10% for test, and ensure representation from all difficulty tiers.
- Preprocessing: Decompose each workbook into row, column, and √N×√N sliding window units. Extract PNG images with captions. Serialize units to preserve headers and indices.
- Retrieval Setup: Deploy a hybrid approach using BM25 for lexical and cosine similarity for dense retrieval. Utilize Reciprocal Rank Fusion (RRF) with ; retrieve top-20 per modality and select the top-10 fused evidence chunks.
- Embedding Model: Utilize Amazon Titan Multimodal or a comparable joint text-vision encoder for unified embedding of text and images.
- Prompting: Maintain JSON output schema (keys: “reasoning”, “answer”); include image attachments for visual chunks and descriptive text for trend tasks.
- Reporting: Report answer accuracy, mean tokens, latency, and provide error analysis by query category.
Faithful adherence to this framework ensures methodological consistency and reproducibility for comparative studies on spreadsheet reasoning models.
6. Analysis of Task Performance and Limitations
Empirical results demonstrate the following performance characteristics:
- Numeric lookup and aggregation tasks achieve high accuracy (>90% on Easy tier).
- Cross-sheet formula reasoning maintains robust accuracy (~70%).
- Visual trend description and image value extraction yield accuracies between 65–75%, signaling that multimodal alignment, particularly for vision-language interaction, remains an open challenge.
A plausible implication is that enhancements in joint text-vision embedding architectures or context-aware retrieval strategies may further improve visual reasoning scores.
7. Significance and Research Applications
FRTR-Bench establishes a new paradigm for benchmarking spreadsheet intelligence, enabling evaluation of scalable, retrieval-augmented, and multimodal LLM frameworks. Its real-world complexity, multimodal evidence, and rigorous protocol render it foundational for future research in spreadsheet QA, financial automation, and enterprise document understanding. By systematizing both multimodal decomposition and unified retrieval, FRTR-Bench provides a reproducible platform for measuring progress on tasks that bridge structured, semi-structured, and unstructured enterprise data (Gulati et al., 13 Jan 2026).