Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoReTab Framework for Text-to-SQL

Updated 3 February 2026
  • CoReTab is a scalable framework that leverages LLM-generated purpose metadata and compatibility caching to select join-coherent table subsets for SQL queries.
  • It employs a three-stage pipeline—dense retrieval, LLM subset selection, and additive join restoration—to address challenges in multi-table text-to-SQL tasks.
  • Empirical evaluations on Bird, Spider, and MMQA benchmarks demonstrate significant improvements in retrieval F1, multi-table execution accuracy, and efficiency over prior methods.

CoReTab (COherent REtrieval of Tables for Text-to-SQL) is a scalable, training-free framework designed to address the bottlenecks in multi-table text-to-SQL retrieval, especially in open-book settings where table collections are large, pooled across sources, and devoid of explicit scoping signals or gold foreign-key annotations. The methodology enriches tables with LLM-generated purpose metadata, precomputes lightweight table compatibility scores, and employs a hybrid pipeline—dense retrieval, LLM-based subset selection, and additive joinability restoration—to yield join-coherent, high-precision table sets for downstream SQL generation. CoReTab achieves marked improvements in retrieval F1, multi-table execution accuracy, and efficiency over prior state-of-the-art systems, as demonstrated on Bird, Spider, and MMQA benchmarks (Soliman et al., 19 Jan 2026).

1. Problem Scope and Formalization

CoReTab addresses the multi-table text-to-SQL task defined as follows: Given a query qq and a pooled set of tables T={t1,,tN}\mathcal{T} = \{t_1, \ldots, t_N\}, the objective is to select a subset S(q)TS(q) \subseteq \mathcal{T} that:

  • Contains all essential (“gold”) tables G(q)G(q) necessary to answer qq,
  • Excludes irrelevant tables to maximize precision,
  • Forms a join-coherent schema supporting feasible SQL generation.

The retrieval objective is formalized as: S(q)=argmaxST[λRecall(S,G(q))(1λ)Irrelevance(S,G(q))]S(q) = \arg\max_{S\subseteq \mathcal{T}} \Bigl[\lambda \cdot \mathrm{Recall}(S,G(q)) - (1-\lambda) \cdot \mathrm{Irrelevance}(S,G(q))\Bigr] with the structural constraint that S(q)S(q) admits a connected join graph under an approximate joinability criterion.

This setup relaxes the assumptions typical in closed-domain benchmarks (e.g., gold database IDs, explicit foreign-key metadata), favoring realistic data integration scenarios.

2. Table Representation via Purpose Metadata

To mitigate the ambiguity inherent to large table pools, CoReTab introduces LLM-generated “purpose” metadata:

  • Each table is serialized into Markdown (header, five data rows) and provided as prompt input to an LLM.
  • The LLM produces a succinct paragraph describing the table’s functional purpose or labels it “None” if semantically vacuous.
  • The concatenation of Markdown and purpose text is embedded to an ete_t vector using a dense encoder ftblf_{\mathrm{tbl}}.

All embeddings {et}\{e_t\} are indexed using FAISS, enabling high-recall dense retrieval step.

This approach systematically enriches raw tabular data with interpretable, semantically discriminative features for retrieval.

3. Table Compatibility Scoring and Caching

Gold foreign-key signals are atypical in federated table pools, so CoReTab constructs a lightweight compatibility cache leveraging column-level heuristics:

  • For each table pair (ti,tj)(t_i, t_j), define a compatibility score CS(ti,tj)[0,1]\mathrm{CS}(t_i, t_j) \in [0,1], reflecting the likelihood of joinability.
  • Column-level primitives (for columns cc, cc') include:
    • Key-likeness (u(c)u(c)),
    • Subset indicator (sub(c,c)\mathrm{sub}(c, c')),
    • Jaccard similarity (jac(c,c)\mathrm{jac}(c, c')),
    • Header similarity: name(c,c)=0.5ex(c,c)+0.5sem(c,c)\mathrm{name}(c, c') = 0.5\, \mathrm{ex}(c, c') + 0.5\, \mathrm{sem}(c, c'), with ex\mathrm{ex} for exact name match and sem\mathrm{sem} as cosine similarity on column embeddings.
  • Validity: only unique-column–to–subset matches are considered.
  • For valid pairs: s(c,c)=I[valid(c,c)]0.5(jac(c,c)+name(c,c))s(c, c') = \mathbb{I}[\mathrm{valid}(c, c')] \cdot 0.5(\mathrm{jac}(c, c') + \mathrm{name}(c, c')).
  • Table-level compatibility: CS(ti,tj)=maxc,cs(c,c)\mathrm{CS}(t_i, t_j) = \max_{c, c'} s(c, c').

All non-trivial (nonzero) compatibilities are cached offline, including argmax join columns, supporting rapid retrieval at inference.

4. Retrieval-Inference Pipeline

The CoReTab inference pipeline consists of three stages:

Stage Input Output
Dense Retrieval Query embedding eqe_q Top-K tables TK(q)T_K(q)
LLM Subset Selection TK(q)T_K(q) + compatibilities Pruned, coherent tables TK(q)T_{K'}(q)
Additive Joinability Restore TK(q)T_{K'}(q), excluded tables Final set S(q)S(q)

4.1 Dense Retrieval:

Compute RS(q,t)=cos(eq,et)\mathrm{RS}(q, t) = \cos(e_q, e_t) for all tables, retrieve top-K by descending score. This delivers high recall but low precision.

4.2 LLM Subset Selection:

The set TK(q)T_K(q) is processed using a single instruction-tuned LLM call (e.g., Llama-3-8B-Instruct) with a structured prompt detailing:

  • The user query,
  • K candidate tables (with names, Markdown, and purpose),
  • Precomputed compatibilities for pairs with CS>0\mathrm{CS}>0.

The LLM, emulating an SQL schema analyst, follows a five-step policy (understand, judge relevance, judge compatibility, form groups, select group), outputs group indices in JSON form:

1
"group_selection": { "selected_group_index": 0 }
The resulting TK(q)T_{K'}(q) set is typically much smaller and more coherent.

4.3 Additive Adjustment:

To safeguard recall, the pipeline performs an additive restoration: for each tTK(q)t \in T_{K'}(q), identify excluded tables tt^* with maximal CS(t,t)\mathrm{CS}(t, t^*), thresholded by τcomp\tau_{\mathrm{comp}}. Any such high-compatibility table is added, producing

S(q)=TK(q)Tcomp(q)S(q) = T_{K'}(q) \cup T_{\mathrm{comp}}(q)

This step is purely restorative, not subtractive, mitigating risks from overzealous LLM pruning.

5. Empirical Evaluation and Analysis

Experiments span Bird, Spider, and MMQA, comprising diverse multi-table settings with pooled tables (no db_id). Baselines include DR@K, cross-encoder reranking (DRR@K), agentic multi-step LLM pipelines (ReAct), MIP-based join-aware selection (JAR), and LLM-guided alignment/voting (ARM).

Key metrics:

  • Precision, Recall, F1 on table subset selection: P=S(q)G(q)S(q),R=S(q)G(q)G(q),F1=2PRP+RP = \frac{|S(q)\cap G(q)|}{|S(q)|},\quad R = \frac{|S(q)\cap G(q)|}{|G(q)|},\quad F1 = \frac{2PR}{P+R}
  • Execution EM (end-to-end SQL result match), broken out for 1-table (EM=1T\mathrm{EM}_{=1T}), multi-table (EM2T\mathrm{EM}_{\geq2T}), and all queries.
  • Efficiency: token counts for selection, SQL cost estimates.

Notable results:

  • F1 gains: Bird +11.4 pts (61.5 vs 50.1 for JAR), Spider +9.6 pts (53.8 vs 44.2), MMQA +9.5 pts (49.9 vs 40.4 for ReAct).
  • On Bird, average tables retrieved is reduced by 16–21% versus fixed-K or voting pipelines.
  • EM2T\mathrm{EM}_{\geq2T} on Bird is improved by +2.0 pts (38.4% vs ARM’s 36.4%) and +5.0 pts over DR@5; similar patterns hold on MMQA and Spider.
  • Token consumption for selection is reduced by 4–5× versus ReAct/ARM, due to the single-call LLM approach.

Qualitative assessment confirms that purpose metadata enhances table disambiguation, compatibility edges foster join-coherent schema formation, and the additive step reliably restores essential “bridge” tables.

6. Ablation, Limitations, and Efficiency

Ablation shows that dense retrieval alone (DR@10) underperforms CoReTab, especially for small LLMs on Spider (+8.2 pts EM). The pipeline remains robust to the size of the selection LLM.

A table summarizing token usage demonstrates marked efficiency gains:

Method Bird Input (M) Bird Output (M) Relative to CORE-T
ReAct 43.5 1.07 4.0× in
ARM 51.7 0.73 4.8× in
COReTab 10.8 1.71 baseline

This suggests the approach is cost-effective for large-scale deployments.

7. Context, Impact, and Concluding Remarks

CoReTab’s principal contributions are threefold: systematic purpose enrichment of tables for retrieval, pragmatic compatibility caching to enforce join coherence, and a resource-efficient LLM-guided selection/augmentation protocol. These innovations yield substantial improvements in both retrieval accuracy and downstream SQL execution, closing approximately 25% of the headroom relative to a perfect-recall oracle. The architecture is inherently low-overhead, requiring only a single LLM selection call and lightweight additive repair, distinguishing it from multi-pass, resource-intensive alternatives.

There is a potential implication that further extension of this methodology (e.g., richer metadata, adaptive compatibility measures) could generalize to broader schema integration or even heterogeneous knowledge graph construction settings.

CoReTab reflects a paradigm shift toward pragmatic, hybrid retrieval–reasoning approaches that are robust in open, unlabeled, multi-source table environments (Soliman et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoReTab Framework.