From Rows to Reasoning (FRTR)

Updated 20 January 2026

FRTR is a scalable framework that decomposes large spreadsheets into retrievable row, column, block, and image units, enabling efficient multimodal reasoning.
The approach employs retrieval-augmented and iterative reasoning pipelines to integrate visual, numeric, and textual data, resulting in substantial accuracy and robustness improvements.
It leverages reinforcement learning and schema-linking techniques to overcome token limitations and preserve spatial and cross-sheet dependencies in enterprise datasets.

From Rows to Reasoning (FRTR) denotes a family of architectures, methodologies, and benchmarks that enable scalable, interpretable, and auditable reasoning over complex spreadsheets and structured tables—particularly those containing vast numerical data, multi-sheet dependencies, and multimodal content (e.g., embedded images, charts). Across state-of-the-art systems, FRTR reframes what was previously a context-compression or naive serialization problem as an instance of retrieval-augmented, structured, often iterative reasoning over granular table units. The approach supports multimodal input, hybrid retrieval pipelines, and iterative thought schemas, with demonstrated advances in accuracy, efficiency, and robustness over large enterprise-grade data.

1. Motivations and Historical Context

The limitations of prior spreadsheet and table reasoning systems stem primarily from two factors: token window restrictions in transformer-based LLMs and the inability of naïve context compression methods to preserve structural, spatial, and visual relationships inherent in real-world datasets. Enterprise workbooks regularly exceed 200,000 rows, multiple sheets with linked formulas, and dozens of images (e.g., FRTR-Bench workbooks contain up to 3.93 million cells and 53 embedded charts) (Gulati et al., 13 Jan 2026).

Previous approaches either serialized entire sheets/workbooks (full-context serialization, leading to excessive token counts >13k and severe “lost-middle” effects), or compressed single sheets using simple text encoders (e.g., SheetCompressor), thereby losing cross-sheet and spatial dependencies and failing on tasks demanding visual evidence or global state aggregation (Gulati et al., 13 Jan 2026). These limitations motivated the development of retrieval-first, multimodal, and row-centric architectures capable of decomposing large tables into granular, computationally tractable units.

2. Retrieval-Augmented Multimodal Architecture

A prototypical FRTR pipeline, as implemented in "From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding" (Gulati et al., 13 Jan 2026), decomposes every workbook or table into four retrievable unit types:

Row Units: Each row serialized with column headers to form minimal evidence slices.
Column Units: Each column paired with row indices, capturing vertical semantics.
Block Windows: Sliding submatrices (size $s \times s$ ) to preserve local spatial context necessary for inferring range-dependent and localized statistical patterns.
Image Units: Embedded charts, receipts, or scanned tables provided as fixed-resolution renderings to vision encoders.

Each unit $u$ is indexed by a multimodal encoder $E$ (Titan Multimodal Embeddings G1), producing a unified embedding vector $v_u$ in a latent space that serves both text and image branches. Consequently, textual queries ("Q4 revenue trend") may retrieve numeric time-series rows or chart images seamlessly.

Query-time retrieval comprises:

Dense Similarity Search: Top $K_v$ units by cosine( $v_q$ , $v_u$ ).
Lexical BM25 Search: Top $K_s$ units by term-based lexical matching.
Reciprocal Rank Fusion (RRF): Combines both ranks via $\mathrm{RRF}(d) = \sum_{r \in \{v,s\}} 1/(k+\mathrm{rank}_r(d))$ , with $k=60$ , favoring stable, interpretable fusion without score calibration.

The top-K fused context units (usually $K=10$ ) form the evidence set, annotated by provenance metadata (sheet, unit type, indices). Structurally, multimodal integration leverages the shared embedding space to allow direct cross-modal retrieval, encoding both numeric and visual context for downstream generation.

3. Iterative Structured Reasoning over Tabular Data

FRTR influences both programmatic and cognitive frameworks for incremental reasoning. The "Table as Thought" methodology (Sun et al., 4 Jan 2025) formalizes reasoning as iterative row-by-row population of an $R \times C$ table, where columns encode context, constraints, and intermediate artifacts and rows correspond to sequential thought steps or derived sub-states.

Given a query $Q$ , the schema $S = \{c_1, ..., c_C\}$ is designed to expose necessary informational facets (e.g., Premise, Subgoal, Operation, Result). At each iteration, the LLM reflects on $T$ (the table state) to propose new rows, with termination determined by completeness and logical constraint satisfaction.

Self-Verification: Each population step enforces (i) non-nullity for all schema columns, and (ii) satisfaction of hard logical constraints $C = \{c_1, ..., c_K\}$ , via the scoring function $\mathrm{score}(T) = \frac{1}{K} \sum_{k=1}^K 1[c_k(T) = \mathrm{True}]$ .

Iterative approaches such as Row-of-Thought (RoT) (Zhang et al., 21 May 2025) further decompose reasoning into explicit row-wise passes, where each step aligns the model’s attention to the current row $u_j$ :

Traversal: For $i$ traversals, each reasoning state $R_i$ aggregates one-pass results $r_{i,j}$ and reflection steps, reducing hallucination by explicit scanning across all units.
Reflection-Based Refinement: After each traversal, the model generates a meta-step "Reflection: is the answer complete?" and updates the next state accordingly.

Ablation studies confirm that iterative and row-wise traversal confers 3–15% accuracy improvements over (i) global chain-of-thought and (ii) cell-level granularities.

4. Schema Linking, Program-of-Thought Generation, and Execution

Table-centric FRTR variants incorporate schema-focused refinement pipelines, as in TableReasoner (Xiong et al., 10 Jul 2025). Here, the raw table $T = (R, C)$ is abstracted to a schema $S_g$ summarizing column metadata, types, statistics, semantics, and sampled example rows. Multi-step schema linking narrows $S_g$ to $S_f$ using:

Sub-query Parsing: an LLM parses the user query $Q$ into ordered sub-queries $\{q_1, ..., q_k\}$ .
Entity Alignment: named entities $E_q$ are mapped to table-cell values via longest common subsequence-based string similarity and LLM selection.
Column Pruning: the focused schema $S_f$ retains only columns relevant to $\{q_i\}$ and aligned entities.

Subsequently, the system generates explicit, executable programs via Chain-of-Thought prompting (Program-of-Thought, PoT), producing Python/pandas code whose output forms the answer. The reasoning workflow is embedded in a ReAct-style loop, iterating "Thought", "Action", and "Observation" steps until the answer is verified or termination criteria are met.

5. Reinforcement Learning for Table Reasoning

The Reasoning-Table framework (Lei et al., 2 Jun 2025) introduces RL optimization to table reasoning, improving generalization and robustness beyond SFT. After serializing tables and chain-of-thought traces, the RL pipeline leverages:

Difficulty-controlled sampling: Rollouts are stratified by pass@8 success rate to focus training on "challenging" instances.
Position Evidence Annotation: The intersection $\cap_{i=1}^k P_i$ of reasoned cell sets from $k$ rollouts forms a robust evidence set, enforced in reward structure.

The final reward $R(o)$ aligns answer correctness, format compliance (“> ”/“<answer>” tag presence), and position evidence overlap:

$R(o) = R_{\mathrm{ans}}(o) \times (1 + \lambda_1 R_{\mathrm{pos}}(o)) + \lambda_2 R_{\mathrm{fmt}}(o)$

RL is applied via Group-Relative PPO (GRPO), optimizing for group-relative advantages and penalizing KL-divergence from reference policies. Empirically, RL-based approaches outperform SFT baselines by 17.36% on TableQA benchmarks and maintain robustness against table format or row/column perturbations.

6. Performance Benchmarks and Comparative Analysis

A selection of FRTR frameworks demonstrates substantial scalability and accuracy improvements over prior benchmarks:

Model/Framework Benchmark Accuracy (EM or %) Token Usage Notable Features

FRTR (Claude 4.5) FRTR-Bench 74 7.7k (vs 13.1k) Multimodal retrieval

FRTR (GPT-5) FRTR-Bench 73 Comparable Multimodal retrieval

FRTR (GPT-5) SpreadsheetLLM 87 6.9k (50% reduction) Token efficiency, all-sheets

Table as Thought Calendar Scheduling (GPT-4o) 74.8 N/A Structured table schema design

Row-of-Thought WikiTableQuestions 78.7 (SOTA) ~220 Iterative traversal, reflection

Reasoning-Table RL Unified TableQA 62.62 N/A Robust RL reward function

FRTR-Bench (Gulati et al., 13 Jan 2026) stresses scalability with tiers up to >20k rows, maintaining >0.66 accuracy on hard workbooks, while prior approaches collapse below 0.10. Ablation analyses confirm retrieval budgets plateau around $K_v \approx 20$ , and iterative verification/multi-row schemas yield clear gains in planning and mathematical benchmarks (Sun et al., 4 Jan 2025). RL-augmented systems maintain generalization under out-of-domain datasets with EM up to 91.33 (Lei et al., 2 Jun 2025).

7. Limitations and Future Research

FRTR pipelines, while scalable and interpretable, face limitations:

Fixed Retrieval Budgets: Current implementations use static $K_v, K_s$ , and fusion parameters; adaptive policies are not yet learned (Gulati et al., 13 Jan 2026).

Fusion Head Absence: Multimodal alignment relies entirely on off-the-shelf embeddings; learned fusion heads or bespoke cross-modal re-rankers may improve chart-series alignment (Gulati et al., 13 Jan 2026).

Black-box LLM Reasoning: Formula execution and numerical verification are not handled natively in current FRTR pipelines. Integrating lightweight spreadsheet engines or symbolic verifiers is an open research path.

Schema Complexity vs. Model Capacity: Overly fine-grained schemas may overfit or overwhelm smaller models, requiring a trade-off between expressivity and generalization (Sun et al., 4 Jan 2025).

Error Propagation in Iteration: Multiple traversals (Row-of-Thought, ReAct loops) improve attention but may incur performance decay for multi-hop/hard questions due to context window exhaustion (Zhang et al., 21 May 2025).

Ongoing directions include dynamic traversal unit selection, schema expressivity scaling, symbolic constraint integration, and extension to multimodal and multi-turn dialogue systems.

FRTR represents a core advancement in table and spreadsheet reasoning, achieving scalable retrieval, structured iterative reasoning, and robust generalization, supported by multi-modal and reinforcement learning processes, and benchmarked across real-world, enterprise-scale, and research datasets (Gulati et al., 13 Jan 2026, Sun et al., 4 Jan 2025, Zhang et al., 21 May 2025, Xiong et al., 10 Jul 2025, Lei et al., 2 Jun 2025).

Model/Framework	Benchmark	Accuracy (EM or %)	Token Usage	Notable Features
FRTR (Claude 4.5)	FRTR-Bench	74	7.7k (vs 13.1k)	Multimodal retrieval
FRTR (GPT-5)	FRTR-Bench	73	Comparable	Multimodal retrieval
FRTR (GPT-5)	SpreadsheetLLM	87	6.9k (50% reduction)	Token efficiency, all-sheets
Table as Thought	Calendar Scheduling (GPT-4o)	74.8	N/A	Structured table schema design
Row-of-Thought	WikiTableQuestions	78.7 (SOTA)	~220	Iterative traversal, reflection
Reasoning-Table RL	Unified TableQA	62.62	N/A	Robust RL reward function