Structured & Unstructured Query Language (SUQL)
- Structured and Unstructured Query Language (SUQL) is a unified query language that combines traditional SQL/Cypher syntax with LLM-driven semantic operators for hybrid data access.
- SUQL extends classical database grammars by incorporating free-text functions, embedding-based filters, and multi-modal predicates to efficiently query unstructured assets.
- Hybrid SUQL systems employ plan–execute–synthesize pipelines, semantic cost models, and advanced indexing techniques to deliver scalable performance in diverse application domains.
A Structured and Unstructured Query Language (SUQL) enables unified, declarative access to both structured databases (e.g., relational tables, property graphs) and unstructured data (e.g., free text, images, video, vector embeddings). SUQL systems integrate the logic and algebraic rigor of classical query languages with the flexibility, retrieval, and reasoning capabilities of LLMs operating over unstructured assets. Recent work establishes formal grammars, algebraic semantics, query-planning toolchains, and cost models that endow SUQL with compositionality, extensibility, and strong empirical performance in conversational assistants, scientific query systems, and hybrid retrieval settings (Liu et al., 2023, Bertorello et al., 2023, Lee et al., 29 Aug 2025, Hassini, 20 Oct 2025, Vangara et al., 14 Jan 2026).
1. Formal Syntax, Semantics, and Algebraic Extensions
SUQL design extends foundational query grammars (SQL, Cypher, Relational Algebra) to admit unstructured data access and reasoning operators:
- Extended Grammar: SUQL queries augment SQL or Cypher with special operators or predicates, e.g. ANSWER/ SUMMARY functions invoking LLMs over a text field (Liu et al., 2023), SEM_WHERE/ SEM_SELECT/ SEM_JOIN for semantic processing (Lee et al., 29 Aug 2025), or sub-property extraction in property graphs (CypherPlus) (Zhao et al., 2021).
- Algebraic Semantics: The SABER algebra (Lee et al., 29 Aug 2025) generalizes selection (σ), projection (π), join (⋈), grouping (γ), intersection, and difference to semantic variants (σ{sem}, π{sem}, ⋈{sem}), where selection is performed via embedding similarity or LLM-based inference over text or multimedia fields.
- BNF/Pseudocode Fragments: Canonical SUQL grammars include both structured operators (SELECT, WHERE, JOIN) and explicit unstructured/semantic operators, e.g.,
1 2 3 4 |
<query> ::= SELECT <proj-list> FROM <table>
[ WHERE <struct-cond> ]
[ SEM_WHERE(<NL-predicate>) ]
[ SEMANTIC_FILTER <sem-cond> ] |
The semantic operators invoke LLMs or embedding-based retrieval for row-wise or join predicates, and the results are combined with structured query execution.
2. System Architectures and Execution Pipelines
Core system architectures reflect a planner–compiler–executor pipeline augmented for hybrid data access (Lee et al., 29 Aug 2025, Bertorello et al., 2023, Hassini, 20 Oct 2025):
- Parser and Planner: The query is parsed and decomposed into structured and unstructured fragments that are planned via combined schema and NL predicate analysis. In DynaQuery (Hassini, 20 Oct 2025) and dIR (Bertorello et al., 2023), a Schema Introspection and Linking Engine (SILE) and a text-→columns LLM pipeline, respectively, organize schema discovery and mapping.
- Execution Engine: Structured sub-queries are handled by native DBMS or graph processing operators; unstructured fragments are executed via LLM calls, embedding-based ANN search, or semantic vector retrieval (e.g., HNSW, IVF_SQL8) (Wu et al., 2022, Ma et al., 9 Jan 2025, Zhao et al., 2021).
- Combiner/Join: Results are combined via user-specified or default fusion operators, e.g. outer joins, intersect, aggregate, or answer synthesis via an LLM (Tan, 2023, Vangara et al., 14 Jan 2026).
A common pattern is "plan–execute–synthesize": LLMs parse and plan, then DBMS and vector search engines retrieve, with final answer synthesis and grounding via LLMs (Vangara et al., 14 Jan 2026).
3. SUQL Operators, Semantic Primitives, and Illustrative Queries
Leading SUQL systems exhibit the following operator patterns:
- Free-Text Primitives: Functions such as ANSWER(text_col, question) and SUMMARY(text_col) invoke LLMs to extract information from a row’s text field, which can then be filtered, sorted, or projected as in SQL (Liu et al., 2023).
- Embedding-Based Semantic Filters: SEM_WHERE('text about X') evaluates to rows where the encoded text field is similar (above threshold θ) to the prompt embedding (Lee et al., 29 Aug 2025).
- Multi-Modal Predicates: Extensions for cross-modal joins and retrieval (e.g. associating a user's face in an image with their structured records) via SIMILARITY or sub-property extraction (Zhao et al., 2021).
- Hybrid Query Pattern Examples:
- Return orders where the product description mentions "organic apples":
1 2 3 4 5
SELECT o.order_id, o.customer_id FROM customer_orders AS o JOIN product_descriptions AS p ON o.product_id = p.product_id WHERE SEM_WHERE('text is about organic apples')
- Hybrid conversational queries (natural-language to full SUQL program), as in Yelp restaurant search with constraints on cuisine, reviews, and location (Liu et al., 2023).
These operators are integrated into classical logical plans with selection pushdown, join-ordering, and cost-based optimization now extended with semantic cost terms (Lee et al., 29 Aug 2025, Ma et al., 9 Jan 2025).
4. Optimization, Indexing, and Cost Models for Hybrid Queries
Efficient SUQL execution relies on multi-modal index design, semantic-aware cost models, and logical optimization:
- Indexing Strategies: Semantic vector indexes (e.g., HNSW, IVF), inverted text indexes, and BLOB metadata indexes support efficient retrieval over unstructured modalities (Wu et al., 2022, Zhao et al., 2021).
- Cost Models: Hybrid cost formulas incorporate:
- Embedding extraction costs
- Vector similarity computation costs (e.g. O(ef) for ef search width in HNSW (Wu et al., 2022))
- LLM invocation and token consumption (critical under per-token pricing and context window constraints) (Tan, 2023)
- Operator selectivity over unstructured subspaces and predicted speeds from semantic cache (Zhao et al., 2021)
- Plan Rewriting and Pushdown: Structured predicates and topology operators are pushed ahead of semantic filters where selectivity estimates justify, minimizing expensive LLM or vector operations (Lee et al., 29 Aug 2025, Zhao et al., 2021, Ma et al., 9 Jan 2025).
CHASE leverages semantic-analysis passes, logical plan rewriting, vectorized code generation, and inlined runtime to achieve - speedups over non-native hybrid query engines (Ma et al., 9 Jan 2025).
5. Empirical Evaluation, Metrics, and Application Domains
SUQL approaches have been evaluated across conversational QA, scientific data access, and industry-scale workloads:
- Quantitative Performance:
- dIR achieves $0.85$ recall and $0.80$ precision on direct hybrid queries, outperforming dense IR and SQL-only baselines (Bertorello et al., 2023).
- SUQL (few-shot GPT-4) attains EM and F1 on HybridQA, within $7.1$ F1 of the SOTA (trained on $62$K data), and entity return accuracy on Yelp conversations (Liu et al., 2023).
- DynaQuery’s SILE pipeline suppresses SCHEMA_HALLUCINATION failures from (RAG) to , and achieves execution accuracy on Spider, on BIRD (Hassini, 20 Oct 2025).
- SABER’s semantic algebra supports SQL-compatible queries with semantic operator costs on the order of per selection, invoking LLMs or embedding services (Lee et al., 29 Aug 2025).
- Systems such as HQANN accelerate hybrid top- retrieval by leveraging fused distance metrics and attribute-aware graph navigation, e.g. recall@$10$ in $50$μs (Wu et al., 2022). PandaDB sustains QPS, and mixed queries complete in tens–hundreds of ms (Zhao et al., 2021).
- SpectraQuery attains SQL correctness, $93$– groundedness in answer synthesis, and expert satisfaction scores $4.1$–$4.6$/$5$ (Vangara et al., 14 Jan 2026).
- Application Areas:
- Open-domain and domain-specific QA, exploratory scientific search (e.g., battery science), recommendation engines, graph-based entity resolution, fraud detection, multi-hop dialogue (Vangara et al., 14 Jan 2026, Bertorello et al., 2023, Zhao et al., 2021).
6. Limitations, Open Problems, and Future Directions
SUQL systems face several technical and operational frontiers:
- Scalability: LLM call costs, context window limits, and schema explosion (thousands of columns from discretization) constrain practical deployments (Bertorello et al., 2023).
- Explainability and Provenance: While structured traceability is well-understood, reliable provenance over LLM-driven retrieval and answer chains remains an open research challenge (Tan, 2023).
- Semantic Ambiguity: Zero-shot classification of large enumerations and value-to-column grounding remain brittle (Liu et al., 2023, Hassini, 20 Oct 2025).
- Optimization: Budget-aware plans—balancing DBMS and LLM costs, retrieval/filter cascades, lazy materialization—remain active research topics (Tan, 2023).
- Generality: Most SUQL systems require fine-grained schema awareness, and evaluation outside curated scientific or QA datasets is limited (Vangara et al., 14 Jan 2026).
- Planned Extensions: Advanced multimodal support (images, XRD, audio), interactive disambiguation, ontology-driven key consolidation, reinforcement learning for interaction policies, debug/verification modes, dynamic operator insertion, and streaming data support are all articulated as practical next steps (Bertorello et al., 2023, Vangara et al., 14 Jan 2026, Hassini, 20 Oct 2025).
7. Selected Comparative Table of SUQL System Properties
The following table summarizes representative SUQL systems and their principal contributions:
| System | Key SUQL Innovations | Empirical Highlights / Domain |
|---|---|---|
| dIR (Bertorello et al., 2023) | Text→columns discretization + few-shot text-to-SQL + ReAct conversational planner | 0.85 recall/0.80 precision; complex multi-hop QA |
| SUQL (Liu et al., 2023) | SQL extension with ANSWER/SUMMARY LLM operators | 90.3% accuracy on real-world conversations |
| SABER (Lee et al., 29 Aug 2025) | Extended Relational Algebra with semantic selection/join/group-by | Costed semantic plans; SQL-compatibility |
| DynaQuery (Hassini, 20 Oct 2025) | SILE schema-linking, per-row multimodal LLM filtering | 80.0% execution accuracy; robust to schema hallucination |
| HQANN (Wu et al., 2022) | Attribute-filtered HNSW for fused queries | 99% recall@10 in <100μs (hybrid ANN) |
| PandaDB (Zhao et al., 2021) | CypherPlus: subproperty operators for property graphs | 100–1000x faster than pipeline baselines |
| CHASE (Ma et al., 9 Jan 2025) | Native relational algebra extensions + compilation to MLIR | 13×–7500× speedup vs. plugin ANN SQL |
| SpectraQuery (Vangara et al., 14 Jan 2026) | Domain-adapted SUQL for scientific structured+literature search | 93–97% groundedness; >4/5 expert rating |
Comprehensive technical depth, cost-aware operatorization, and tight integration of DBMS and LLM-based unstructured reasoning are hallmarks of modern SUQL research. The evolving landscape is marked by increasing formal rigor, operator modularity, and empirical validation, with significant progress on scalability and composability for hybrid data access in practical, high-value domains.