TiInsight: Automated LLM-Driven Data Analysis
- TiInsight is an automated, LLM-driven exploratory data analysis system that transforms natural language queries into SQL and visualizations without requiring coding expertise.
- It integrates hierarchical data context encoding with multi-phase pipelines to enhance schema comprehension and recommend accurate visualizations.
- The system achieves robust, industrial-grade performance across various domains such as finance, retail, and science, streamlining EDA workflows.
TiInsight is an automated, LLM-driven exploratory data analysis (EDA) system that enables users to interactively analyze and visualize structured data via natural language queries, without requiring SQL or charting expertise. Its core design integrates hierarchical data context encoding, multi-phase LLM pipelines, and heuristic-augmented chart recommendation, achieving robust cross-domain generalization and industrial-grade performance for production environments. TiInsight has been implemented and open-sourced by PingCAP, with public API specifications and a live deployment interface supporting datasets across finance, retail, and science (Zhu et al., 2024, Zhu et al., 14 Jan 2026).
1. Motivation and System Overview
Exploratory data analysis in real-world data platforms is constrained by four major bottlenecks: the technical barrier of crafting complex SQL queries, selecting effective visualizations, coping with large and ambiguous database schemas, and the inability of existing solutions to generalize across data domains without manual fine-tuning. TiInsight addresses these issues through a zero-touch, automated pipeline that transforms natural language queries into interpretable visualizations, functioning across arbitrary relational schemas and organizational verticals (Zhu et al., 2024, Zhu et al., 14 Jan 2026).
The primary system objective is to automate EDA workflows end-to-end:
- Explicate complex and terse SQL schemas to both LLMs and end users
- Clarify ambiguous or underspecified user queries, often multi-intent
- Generate accurate, executable SQL and refine in response to runtime feedback
- Automatically select and render visualizations appropriate for results
- Support seamless cross-domain deployment without per-dataset tuning
TiInsight’s design is modular but tightly integrated. The major pipeline modules comprise: (i) Hierarchical Data Context (HDC) generation; (ii) question clarification and decomposition; (iii) text-to-SQL generation (TiSQL); and (iv) data visualization recommendation and rendering (TiChart). Each module is exposed as a documented RESTful API and may be operated independently or via a conversational GUI (Zhu et al., 2024, Zhu et al., 14 Jan 2026).
2. Hierarchical Data Context (HDC) and Schema Representation
The HDC architecture is TiInsight’s principal innovation, converting large and heterogeneous SQL schemas into multi-level, natural-language summaries for robust context retrieval. This overcomes context-window and prompt-size constraints inherent to LLMs and facilitates semantic search and schema linking in arbitrarily large databases.
- Column Summaries: Columns are partitioned into fixed-size, vertically grouped blocks. Each group is summarized by prompting an LLM with DDL, column samples, and domain embeddings (where present), yielding succinct NL descriptors.
- Table Summaries: Table description is executed as a map-reduce paradigm—parallel LLM summarization of column groups (Map), then iterative consolidation into a unified table synopsis (Reduce). Table summaries document keys, attribute roles, table type (fact, dimension, bridge), and main entity.
- Table Relationships: Foreign-key and primary-key relationships are discovered via a two-stage coarse-to-fine process: candidate tables are first retrieved by vector similarity on their summaries, then LLMs establish referential integrity and cardinality in CoT prompts, reducing complexity from to for tables.
- Database and Entity Summaries: By extracting the most connected tables from the schema relationship graph, the LLM infers major entities, synthesizes entity-level themes and attributes, and delivers a business-facing database synopsis (Zhu et al., 2024, Zhu et al., 14 Jan 2026).
All HDC layers are persistently stored in a vector database (e.g., TiDB-Vector, Pinecone, or ChromaDB) and are instantly accessible by downstream modules for similarity-based retrieval and prompt construction.
3. EDA Pipeline: Clarification, Decomposition, SQL Synthesis, Visualization
The system pipeline is a multi-stage process summarized as follows:
3.1 HDC Generation
- Triggered asynchronously when a new database is connected
- Parallel generation of schema, table, relationship, and database summaries
- Storage in vector index for downstream retrieval
3.2 Question Clarification and Decomposition
- User question is first passed to an LLM augmented with HDC to:
- Identify ambiguous terms and missing parameters (e.g., metrics, time windows)
- Generate follow-up clarifications
- If is composite, decompose into using a library of standard operating procedures and few-shot retrieval
- Output is a sequence of clarified or atomic sub-questions together with a walkthrough narrative
3.3 Text-to-SQL Generation (TiSQL)
- Schema Filtering: Map-reduce style two-phase filtering—first, coarse retrieval of top-N relevant tables by vector similarity between user question and table HDC; then fine-grained LLM-aided pruning to retain only essential tables and columns.
- SQL Synthesis: Prompts include few-shot NL-to-SQL examples, minimal filtered schema summaries, and the clarified question. The LLM is tasked with keyword-to-column mapping, canonical name formatting, and SQL snippet generation.
- Self-Refinement Chain: Generated SQL is successively refined:
- EXPLAIN-refinement: SQL is parsed and errors are fed back to LLM for automatic correction.
- EXECUTION-refinement: If runtime errors are encountered, these are looped back for LLM repair. Iteration continues until either a correct SQL is produced or a maximum step count is reached.
Performance Metrics: Execution accuracy (EX), defined as the fraction of queries yielding correct result sets, and reward-based valid efficiency score (Bird dataset) are the principal metrics, evaluated on standard benchmarks such as Spider and Bird (Zhu et al., 2024, Zhu et al., 14 Jan 2026).
3.4 Data Visualization (TiChart)
- Heuristic Rule Mapping: Chart type inference is governed by explicit rules:
- Pie chart: single categorical distribution (label/value)
- Bar chart: grouped/aggregated categorical–numeric mapping
- Line chart: time series or continuous axis
- LLM Verification: An LLM is prompted with the rule-suggested chart and output data; it may override or suggest alternatives based on appropriateness.
- Rendering: Visualizations are rendered using D3JS or an internal engine, returning tables if no suitable chart is found.
4. Implementation, Deployment, and System Characteristics
TiInsight’s implementation employs microservice separation for major components (HDC, Clarification, TiSQL, TiChart), each exposing RESTful APIs. The system:
- Provides a web-based, asynchronous GUI at https://tiinsight.chat, supporting chat interaction, job management, and embedding of clarifications with live chart panes.
- Runs in production on Kubernetes with per-pod isolation. LLMs are accessed via external APIs (e.g., OpenAI, Claude), and embeddings are handled via text-embedding-ada-002. Vector databases include TiDB-Vector, Pinecone, or Chroma.
- Achieves typical latencies of 10–15 s (one-time HDC generation per database), 2–3 s per SQL query, and under 500 ms for chart rendering. Cost per NL→SQL→chart end-to-end run is $0.30–0.40 with GPT-4,$0.02 using lower-cost LLMs (Zhu et al., 2024, Zhu et al., 14 Jan 2026).
5. Evaluation and Comparative Performance
Extensive benchmarking and user studies demonstrate that TiInsight matches or approaches state-of-the-art baselines, often outperforming rule- or PLM-based methods on real enterprise schemas:
- Spider benchmark: TiSQL + GPT-4 achieves 86.3% execution accuracy, 0.3% below the top-performing LLM-based system (DAIL-SQL+GPT-4+self-consistency), but with superior efficiency (Zhu et al., 2024).
- Bird benchmark: TiSQL attains 60.98% execution accuracy, tying or beating expensive fine-tuned models.
- End-to-end success rates for public tasks exceed 82% across finance, retail, and gaming domains (Zhu et al., 2024).
- In controlled user studies, TiInsight closely matches human-expert scores in relevance and completeness, and outperforms zero-shot and metadata-prompted LLM baselines in both SQL generation and visualization appropriateness (Zhu et al., 2024).
| System | Exec. Accuracy (Spider) | Exec. Accuracy (Bird) | Notes |
|---|---|---|---|
| Duoquest (rule-based) | 63.5% | — | |
| RESDSQL-3B + NatSQL | 79.9% | — | PLM-based |
| DAIL-SQL+GPT-4 | 86.6% | — | LLM-based SOTA |
| TiSQL + GPT-4 | 86.3% | 60.98% | No fine-tuning; robust cross-domain performance |
TiInsight’s HDC architecture is identified as the key to robust out-of-the-box performance, with LLMs leveraging structured summaries to avoid schema overfitting (Zhu et al., 2024, Zhu et al., 14 Jan 2026).
6. Limitations and Prospective Developments
- Scalability: While HDC reduces complexity, generation over schemas with >10K tables remains non-trivial. Future work involves incremental updating and schema-change triggers.
- Cost: High-quality LLM calls are the dominant expense; research is ongoing into “small-LLM ↔ large-LLM” handoffs and on-prem deployments.
- Visualization Breadth: TiChart currently supports core chart types (pie, bar, line); extensions to scatter, heatmap, boxplot, and geographic representations are planned.
- Intent Understanding: Deeply nested or counterfactual queries challenge current decomposition heuristics; integration with causal reasoning modules and DSLs is under exploration (Zhu et al., 2024, Zhu et al., 14 Jan 2026).
7. Illustrative System Walkthroughs
TiInsight’s operational flow is exemplified by domain-spanning analyses, such as:
- “Impact of Federal Reserve interest rate hikes”:
- System clarifies timeframe, decomposes query into monthly data extraction and correlation, autogenerates respective SQL, and selects a dual-axis line chart visualizing lagged negative correlation between rates and oil prices.
- “Changes in oil prices?”:
- System requests clarification (trend vs. distribution), retrieves monthly averages for the past year, and proposes a line chart for visualization (Zhu et al., 14 Jan 2026).
These workflows underscore the coordinated use of LLM-driven schema abstraction (HDC), interactive clarification, robust SQL synthesis (TiSQL), and heuristic-augmented visualization (TiChart).
TiInsight establishes a practical paradigm for automated, cross-domain EDA, integrating deep LLMs with rigorous schema abstraction and workflow automation. All code, demos, and APIs are open-sourced to accelerate further research and adoption (Zhu et al., 2024, Zhu et al., 14 Jan 2026).