CoReTab Framework for Text-to-SQL
- CoReTab is a scalable framework that leverages LLM-generated purpose metadata and compatibility caching to select join-coherent table subsets for SQL queries.
- It employs a three-stage pipeline—dense retrieval, LLM subset selection, and additive join restoration—to address challenges in multi-table text-to-SQL tasks.
- Empirical evaluations on Bird, Spider, and MMQA benchmarks demonstrate significant improvements in retrieval F1, multi-table execution accuracy, and efficiency over prior methods.
CoReTab (COherent REtrieval of Tables for Text-to-SQL) is a scalable, training-free framework designed to address the bottlenecks in multi-table text-to-SQL retrieval, especially in open-book settings where table collections are large, pooled across sources, and devoid of explicit scoping signals or gold foreign-key annotations. The methodology enriches tables with LLM-generated purpose metadata, precomputes lightweight table compatibility scores, and employs a hybrid pipeline—dense retrieval, LLM-based subset selection, and additive joinability restoration—to yield join-coherent, high-precision table sets for downstream SQL generation. CoReTab achieves marked improvements in retrieval F1, multi-table execution accuracy, and efficiency over prior state-of-the-art systems, as demonstrated on Bird, Spider, and MMQA benchmarks (Soliman et al., 19 Jan 2026).
1. Problem Scope and Formalization
CoReTab addresses the multi-table text-to-SQL task defined as follows: Given a query and a pooled set of tables , the objective is to select a subset that:
- Contains all essential (“gold”) tables necessary to answer ,
- Excludes irrelevant tables to maximize precision,
- Forms a join-coherent schema supporting feasible SQL generation.
The retrieval objective is formalized as: with the structural constraint that admits a connected join graph under an approximate joinability criterion.
This setup relaxes the assumptions typical in closed-domain benchmarks (e.g., gold database IDs, explicit foreign-key metadata), favoring realistic data integration scenarios.
2. Table Representation via Purpose Metadata
To mitigate the ambiguity inherent to large table pools, CoReTab introduces LLM-generated “purpose” metadata:
- Each table is serialized into Markdown (header, five data rows) and provided as prompt input to an LLM.
- The LLM produces a succinct paragraph describing the table’s functional purpose or labels it “None” if semantically vacuous.
- The concatenation of Markdown and purpose text is embedded to an vector using a dense encoder .
All embeddings are indexed using FAISS, enabling high-recall dense retrieval step.
This approach systematically enriches raw tabular data with interpretable, semantically discriminative features for retrieval.
3. Table Compatibility Scoring and Caching
Gold foreign-key signals are atypical in federated table pools, so CoReTab constructs a lightweight compatibility cache leveraging column-level heuristics:
- For each table pair , define a compatibility score , reflecting the likelihood of joinability.
- Column-level primitives (for columns , ) include:
- Key-likeness (),
- Subset indicator (),
- Jaccard similarity (),
- Header similarity: , with for exact name match and as cosine similarity on column embeddings.
- Validity: only unique-column–to–subset matches are considered.
- For valid pairs: .
- Table-level compatibility: .
All non-trivial (nonzero) compatibilities are cached offline, including argmax join columns, supporting rapid retrieval at inference.
4. Retrieval-Inference Pipeline
The CoReTab inference pipeline consists of three stages:
| Stage | Input | Output |
|---|---|---|
| Dense Retrieval | Query embedding | Top-K tables |
| LLM Subset Selection | + compatibilities | Pruned, coherent tables |
| Additive Joinability Restore | , excluded tables | Final set |
4.1 Dense Retrieval:
Compute for all tables, retrieve top-K by descending score. This delivers high recall but low precision.
4.2 LLM Subset Selection:
The set is processed using a single instruction-tuned LLM call (e.g., Llama-3-8B-Instruct) with a structured prompt detailing:
- The user query,
- K candidate tables (with names, Markdown, and purpose),
- Precomputed compatibilities for pairs with .
The LLM, emulating an SQL schema analyst, follows a five-step policy (understand, judge relevance, judge compatibility, form groups, select group), outputs group indices in JSON form:
1 |
"group_selection": { "selected_group_index": 0 } |
4.3 Additive Adjustment:
To safeguard recall, the pipeline performs an additive restoration: for each , identify excluded tables with maximal , thresholded by . Any such high-compatibility table is added, producing
This step is purely restorative, not subtractive, mitigating risks from overzealous LLM pruning.
5. Empirical Evaluation and Analysis
Experiments span Bird, Spider, and MMQA, comprising diverse multi-table settings with pooled tables (no db_id). Baselines include DR@K, cross-encoder reranking (DRR@K), agentic multi-step LLM pipelines (ReAct), MIP-based join-aware selection (JAR), and LLM-guided alignment/voting (ARM).
Key metrics:
- Precision, Recall, F1 on table subset selection:
- Execution EM (end-to-end SQL result match), broken out for 1-table (), multi-table (), and all queries.
- Efficiency: token counts for selection, SQL cost estimates.
Notable results:
- F1 gains: Bird +11.4 pts (61.5 vs 50.1 for JAR), Spider +9.6 pts (53.8 vs 44.2), MMQA +9.5 pts (49.9 vs 40.4 for ReAct).
- On Bird, average tables retrieved is reduced by 16–21% versus fixed-K or voting pipelines.
- on Bird is improved by +2.0 pts (38.4% vs ARM’s 36.4%) and +5.0 pts over DR@5; similar patterns hold on MMQA and Spider.
- Token consumption for selection is reduced by 4–5× versus ReAct/ARM, due to the single-call LLM approach.
Qualitative assessment confirms that purpose metadata enhances table disambiguation, compatibility edges foster join-coherent schema formation, and the additive step reliably restores essential “bridge” tables.
6. Ablation, Limitations, and Efficiency
Ablation shows that dense retrieval alone (DR@10) underperforms CoReTab, especially for small LLMs on Spider (+8.2 pts EM). The pipeline remains robust to the size of the selection LLM.
A table summarizing token usage demonstrates marked efficiency gains:
| Method | Bird Input (M) | Bird Output (M) | Relative to CORE-T |
|---|---|---|---|
| ReAct | 43.5 | 1.07 | 4.0× in |
| ARM | 51.7 | 0.73 | 4.8× in |
| COReTab | 10.8 | 1.71 | baseline |
This suggests the approach is cost-effective for large-scale deployments.
7. Context, Impact, and Concluding Remarks
CoReTab’s principal contributions are threefold: systematic purpose enrichment of tables for retrieval, pragmatic compatibility caching to enforce join coherence, and a resource-efficient LLM-guided selection/augmentation protocol. These innovations yield substantial improvements in both retrieval accuracy and downstream SQL execution, closing approximately 25% of the headroom relative to a perfect-recall oracle. The architecture is inherently low-overhead, requiring only a single LLM selection call and lightweight additive repair, distinguishing it from multi-pass, resource-intensive alternatives.
There is a potential implication that further extension of this methodology (e.g., richer metadata, adaptive compatibility measures) could generalize to broader schema integration or even heterogeneous knowledge graph construction settings.
CoReTab reflects a paradigm shift toward pragmatic, hybrid retrieval–reasoning approaches that are robust in open, unlabeled, multi-source table environments (Soliman et al., 19 Jan 2026).