CoReTab Framework for Text-to-SQL

Updated 3 February 2026

CoReTab is a scalable framework that leverages LLM-generated purpose metadata and compatibility caching to select join-coherent table subsets for SQL queries.
It employs a three-stage pipeline—dense retrieval, LLM subset selection, and additive join restoration—to address challenges in multi-table text-to-SQL tasks.
Empirical evaluations on Bird, Spider, and MMQA benchmarks demonstrate significant improvements in retrieval F1, multi-table execution accuracy, and efficiency over prior methods.

CoReTab (COherent REtrieval of Tables for Text-to-SQL) is a scalable, training-free framework designed to address the bottlenecks in multi-table text-to-SQL retrieval, especially in open-book settings where table collections are large, pooled across sources, and devoid of explicit scoping signals or gold foreign-key annotations. The methodology enriches tables with LLM-generated purpose metadata, precomputes lightweight table compatibility scores, and employs a hybrid pipeline—dense retrieval, LLM-based subset selection, and additive joinability restoration—to yield join-coherent, high-precision table sets for downstream SQL generation. CoReTab achieves marked improvements in retrieval F1, multi-table execution accuracy, and efficiency over prior state-of-the-art systems, as demonstrated on Bird, Spider, and MMQA benchmarks (Soliman et al., 19 Jan 2026).

1. Problem Scope and Formalization

CoReTab addresses the multi-table text-to-SQL task defined as follows: Given a query $q$ and a pooled set of tables $\mathcal{T} = \{t_1, \ldots, t_N\}$ , the objective is to select a subset $S(q) \subseteq \mathcal{T}$ that:

Contains all essential (“gold”) tables $G(q)$ necessary to answer $q$ ,
Excludes irrelevant tables to maximize precision,
Forms a join-coherent schema supporting feasible SQL generation.

The retrieval objective is formalized as: $S(q) = \arg\max_{S\subseteq \mathcal{T}} \Bigl[\lambda \cdot \mathrm{Recall}(S,G(q)) - (1-\lambda) \cdot \mathrm{Irrelevance}(S,G(q))\Bigr]$ with the structural constraint that $S(q)$ admits a connected join graph under an approximate joinability criterion.

This setup relaxes the assumptions typical in closed-domain benchmarks (e.g., gold database IDs, explicit foreign-key metadata), favoring realistic data integration scenarios.

2. Table Representation via Purpose Metadata

To mitigate the ambiguity inherent to large table pools, CoReTab introduces LLM-generated “purpose” metadata:

Each table is serialized into Markdown (header, five data rows) and provided as prompt input to an LLM.
The LLM produces a succinct paragraph describing the table’s functional purpose or labels it “None” if semantically vacuous.
The concatenation of Markdown and purpose text is embedded to an $e_t$ vector using a dense encoder $f_{\mathrm{tbl}}$ .

All embeddings $\{e_t\}$ are indexed using FAISS, enabling high-recall dense retrieval step.

This approach systematically enriches raw tabular data with interpretable, semantically discriminative features for retrieval.

3. Table Compatibility Scoring and Caching

Gold foreign-key signals are atypical in federated table pools, so CoReTab constructs a lightweight compatibility cache leveraging column-level heuristics:

For each table pair $(t_i, t_j)$ , define a compatibility score $\mathrm{CS}(t_i, t_j) \in [0,1]$ , reflecting the likelihood of joinability.
Column-level primitives (for columns $c$ $c$ , $c'$ $c^{'}$ ) include:
- Key-likeness ( $u(c)$ ),
- Subset indicator ( $\mathrm{sub}(c, c')$ ),
- Jaccard similarity ( $\mathrm{jac}(c, c')$ ),
- Header similarity: $\mathrm{name}(c, c') = 0.5\, \mathrm{ex}(c, c') + 0.5\, \mathrm{sem}(c, c')$ , with $\mathrm{ex}$ for exact name match and $\mathrm{sem}$ as cosine similarity on column embeddings.
Validity: only unique-column–to–subset matches are considered.
For valid pairs: $s(c, c') = \mathbb{I}[\mathrm{valid}(c, c')] \cdot 0.5(\mathrm{jac}(c, c') + \mathrm{name}(c, c'))$ .
Table-level compatibility: $\mathrm{CS}(t_i, t_j) = \max_{c, c'} s(c, c')$ .

All non-trivial (nonzero) compatibilities are cached offline, including argmax join columns, supporting rapid retrieval at inference.

4. Retrieval-Inference Pipeline

The CoReTab inference pipeline consists of three stages:

Stage	Input	Output
Dense Retrieval	Query embedding $e_q$	Top-K tables $T_K(q)$
LLM Subset Selection	$T_K(q)$ + compatibilities	Pruned, coherent tables $T_{K'}(q)$
Additive Joinability Restore	$T_{K'}(q)$ , excluded tables	Final set $S(q)$

4.1 Dense Retrieval:

Compute $\mathrm{RS}(q, t) = \cos(e_q, e_t)$ for all tables, retrieve top-K by descending score. This delivers high recall but low precision.

4.2 LLM Subset Selection:

The set $T_K(q)$ is processed using a single instruction-tuned LLM call (e.g., Llama-3-8B-Instruct) with a structured prompt detailing:

The user query,
K candidate tables (with names, Markdown, and purpose),
Precomputed compatibilities for pairs with $\mathrm{CS}>0$ .

The LLM, emulating an SQL schema analyst, follows a five-step policy (understand, judge relevance, judge compatibility, form groups, select group), outputs group indices in JSON form:

1	"group_selection": { "selected_group_index": 0 }

The resulting

T_{K'}(q)

set is typically much smaller and more coherent.

4.3 Additive Adjustment:

To safeguard recall, the pipeline performs an additive restoration: for each $t \in T_{K'}(q)$ , identify excluded tables $t^*$ with maximal $\mathrm{CS}(t, t^*)$ , thresholded by $\tau_{\mathrm{comp}}$ . Any such high-compatibility table is added, producing

$S(q) = T_{K'}(q) \cup T_{\mathrm{comp}}(q)$

This step is purely restorative, not subtractive, mitigating risks from overzealous LLM pruning.

5. Empirical Evaluation and Analysis

Experiments span Bird, Spider, and MMQA, comprising diverse multi-table settings with pooled tables (no db_id). Baselines include DR@K, cross-encoder reranking (DRR@K), agentic multi-step LLM pipelines (ReAct), MIP-based join-aware selection (JAR), and LLM-guided alignment/voting (ARM).

Key metrics:

Precision, Recall, F1 on table subset selection: $P = \frac{|S(q)\cap G(q)|}{|S(q)|},\quad R = \frac{|S(q)\cap G(q)|}{|G(q)|},\quad F1 = \frac{2PR}{P+R}$
Execution EM (end-to-end SQL result match), broken out for 1-table ( $\mathrm{EM}_{=1T}$ ), multi-table ( $\mathrm{EM}_{\geq2T}$ ), and all queries.
Efficiency: token counts for selection, SQL cost estimates.

Notable results:

F1 gains: Bird +11.4 pts (61.5 vs 50.1 for JAR), Spider +9.6 pts (53.8 vs 44.2), MMQA +9.5 pts (49.9 vs 40.4 for ReAct).
On Bird, average tables retrieved is reduced by 16–21% versus fixed-K or voting pipelines.
$\mathrm{EM}_{\geq2T}$ on Bird is improved by +2.0 pts (38.4% vs ARM’s 36.4%) and +5.0 pts over DR@5; similar patterns hold on MMQA and Spider.
Token consumption for selection is reduced by 4–5× versus ReAct/ARM, due to the single-call LLM approach.

Qualitative assessment confirms that purpose metadata enhances table disambiguation, compatibility edges foster join-coherent schema formation, and the additive step reliably restores essential “bridge” tables.

6. Ablation, Limitations, and Efficiency

Ablation shows that dense retrieval alone (DR@10) underperforms CoReTab, especially for small LLMs on Spider (+8.2 pts EM). The pipeline remains robust to the size of the selection LLM.

A table summarizing token usage demonstrates marked efficiency gains:

Method	Bird Input (M)	Bird Output (M)	Relative to CORE-T
ReAct	43.5	1.07	4.0× in
ARM	51.7	0.73	4.8× in
COReTab	10.8	1.71	baseline

This suggests the approach is cost-effective for large-scale deployments.

7. Context, Impact, and Concluding Remarks

CoReTab’s principal contributions are threefold: systematic purpose enrichment of tables for retrieval, pragmatic compatibility caching to enforce join coherence, and a resource-efficient LLM-guided selection/augmentation protocol. These innovations yield substantial improvements in both retrieval accuracy and downstream SQL execution, closing approximately 25% of the headroom relative to a perfect-recall oracle. The architecture is inherently low-overhead, requiring only a single LLM selection call and lightweight additive repair, distinguishing it from multi-pass, resource-intensive alternatives.

There is a potential implication that further extension of this methodology (e.g., richer metadata, adaptive compatibility measures) could generalize to broader schema integration or even heterogeneous knowledge graph construction settings.

CoReTab reflects a paradigm shift toward pragmatic, hybrid retrieval–reasoning approaches that are robust in open, unlabeled, multi-source table environments (Soliman et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CORE-T: COherent REtrieval of Tables for Text-to-SQL (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoReTab Framework.