Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

Published 17 May 2025 in cs.CL and cs.IR | (2505.11900v1)

Abstract: Question answering over mixed sources, like text and tables, has been advanced by verbalizing all contents and encoding it with a LLM. A prominent case of such heterogeneous data is personal information: user devices log vast amounts of data every day, such as calendar entries, workout statistics, shopping records, streaming history, and more. Information needs range from simple look-ups to queries of analytical nature. The challenge is to provide humans with convenient access with small footprint, so that all personal data stays on the user devices. We present ReQAP, a novel method that creates an executable operator tree for a given question, via recursive decomposition. Operators are designed to enable seamless integration of structured and unstructured sources, and the execution of the operator tree yields a traceable answer. We further release the PerQA benchmark, with persona-based data and questions, covering a diverse spectrum of realistic user needs.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces ReQAP, which recursively decomposes complex questions into operator trees to integrate both structured and unstructured personal data.
It employs a two-stage framework where recursive question decomposition (QUD) is followed by operator tree execution (OTX), ensuring efficient and interpretable data processing.
The study presents the PerQA benchmark and demonstrates that small-model variants achieve competitive performance while adhering to privacy and on-device constraints.

Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

This paper introduces ReQAP, a new methodology for complex question answering (QA) over heterogeneous personal data, with a focus on privacy-preserving, on-device execution. The work advances state-of-the-art approaches by formalizing recursive question decomposition into executable operator trees that integrate both structured and unstructured user data. The study also presents PerQA, a large benchmark dataset reflecting realistic personal data and complex user queries, and provides an extensive experimental evaluation.

Motivation and Problem Setting

Modern personal information management increasingly involves answering complex user questions that span heterogeneous data sources: structured (e.g., workout logs), semi-structured (e.g., calendar entries), and unstructured (e.g., email, social media). Heterogeneity introduces several challenges:

Expressiveness: User queries range from simple lookups to complex temporal, analytical, or aggregational questions.
Scalability: Contextualizing and aggregating thousands of events is required for many queries.
Privacy: Users demand local, on-device data processing to ensure privacy and data control, precluding most cloud-based solutions.

Two dominant paradigms in prior heterogeneous QA approaches are:

Verbalization: Linearizing evidence across modalities and feeding all content through an LLM, frequently hitting context and reasoning bottlenecks when scaling to large or numerically complex datasets.
Translation (Text2SQL/NL2SQL): Translating the question into a code/query (like SQL), enabling direct execution on structured data, but failing when relevant data is under-defined or unstructured.

ReQAP seeks to overcome these limitations by combining recursive decomposition with operator-based execution, accommodating both structured and unstructured data, and maintaining a low resource footprint suitable for personal devices.

ReQAP: Methodological Framework

ReQAP decomposes the QA task into two stages: question understanding and decomposition (QUD) and operator tree execution (OTX).

Data Model

All personal data is modeled as temporally ordered lists of events (key-value dictionaries). This uniform data abstraction enables agile integration of arbitrary new data sources (e.g., fitness data, emails, streaming history).

Stage 1: Recursive Question Decomposition (QUD)

QUD creates an operator tree that recursively breaks down a user's natural language question into executable sub-questions, ultimately mapping each to predefined operators. This process leverages:

In-Context Learning (ICL): LLMs are prompted with carefully crafted operator tree generation examples. Rather than generating a full tree in one pass (which is brittle), the approach recursively decomposes questions, calling the LLM repeatedly and iteratively producing partial trees for each sub-question.
Operator Set: Operators include classical database functions (join, filter, group_by, aggregate) and two novel components:
- RETRIEVE: Retrieves all relevant events matching a sub-question, across full heterogeneity in source and field structure. Employs SPLADE for high-recall initial filtering and cross-encoder reranking for high-precision selection.
- EXTRACT: Performs semantic extraction of key-value pairs, including information extraction from long text fields using a small sequence-to-sequence model (e.g., BART).
Model Distillation for On-Device Deployability: The operator tree generation process is distilled into small-parameter models (down to 1B), enabling inference on user-controlled hardware, while larger models (e.g., LLaMA 70B, GPT4o) are used for training data and reference.

Stage 2: Operator Tree Execution (OTX)

Given an operator tree, OTX evaluates the plan bottom-up against the actual user data:

RETRIEVE first filters all data sources with fast, sparse retrieval (SPLADE), then uses key-value pattern recognition and cross-encoder reranking for granularity and deduplication.
EXTRACT applies key or attribute mapping, information extraction, or mappings derived from the QUD stage.
Standard set and aggregation operations (JOIN, FILTER, GROUP_BY, MAP, APPLY, UNNEST, ARGMAX, etc.) are implemented as efficient Python primitives on the event objects, often leveraging hash-based grouping and efficient merge routines.

The integration of traceable, interpretable operator trees provides both auditability (users can inspect which events were used to generate the answer) and extensibility (user-defined functions can easily be incorporated).

PerQA: Benchmark Dataset

The PerQA benchmark is a large, realistic dataset for personal QA, constructed in three phases:

Persona Collection: 20 fictional personas are generated via detailed questionnaires covering demographics, relationships, interests, and usage patterns.
Canonicalized Event Generation: For each persona, roughly 40k structured events are synthesized, leveraging Wikidata, real-world product dumps, and public datasets. Behaviors are made naturalistic by modeling plausible time and activity patterns.
Event Verbalization and Question Generation: Events are verbalized into naturalistic calendar entries, emails, and social media posts using LLMs (LLaMA3.1, GPT4o). 3,500 unique complex questions are generated and paired with SQL-derived ground-truth answers, spanning aggregation, temporal reasoning, joins, ordering, and grouping.

Extensive data curation ensures question relevance and removes spurious empty-answer queries.

Empirical Evaluation

The study presents extensive quantitative and qualitative evaluation, contrasting ReQAP with strong baselines:

RAG (retrieval-augmented generation via LLMs; verbalization paradigm)
CodeGen (NL2SQL translation paradigm)

Key results on PerQA:

Hit@1 (main metric, ±10% tolerance for numerics): ReQAP with SFT (1B model) achieves 0.380, competitive with large-model variants (GPT4o: 0.386). CodeGen achieves 0.315 (1B). RAG baseline lags significantly (≤0.029 for 1B).
Performance across question types: ReQAP consistently outperforms (sometimes by >10% absolute) on ordering, grouping, temporal, aggregation, and join-heavy queries. Temporal and multi-source queries benefit particularly from ReQAP's recursive decomposition.
Ablation studies: Disabling recursive QUD (single-shot), using simplistic/surface-only RETRIEVE or EXTRACT variants, or replacing recursive with flat decomposition yields significant performance degradation.
Scalability: ReQAP delivers acceptable performance/latency tradeoffs across QUD/operator module sizes ranging from 135M to 3B parameters, making it widely deployable.
User study: On real exported personal data from 20 users, 94% of user-generated queries could be mapped to structurally isomorphic PerQA questions; users rated ReQAP's answers correct or nearly correct in 69% of cases.

Analysis and Contradictory Claims

The paper asserts that previously dominant retrieval-augmented generation pipelines fail for complex aggregation and analytical queries on heterogeneous user data due to scaling and context limitations of LLMs. Empirically, RAG achieves only 0.029 Hit@1 with the 1B model and remains noncompetitive even with large models. ReQAP's structured, recursive decomposition—by contrast—enables both high recall of relevant events and effective semantic aggregation.

Additionally, the small-model variants of ReQAP, with only a minor drop in absolute performance compared to >100B parameter baselines, challenge the assumption that LLM-centric QA must rely on massive models and cloud resources.

Practical and Theoretical Implications

The ReQAP framework provides several advantages for real-world deployment:

On-device Execution: All models and execution logic can be confined to user-controlled devices, strongly adhering to privacy constraints.
Extensibility: The operator-based architecture supports new data sources and task-specific extensions with minimal modification.
Auditability: The operator tree intermediate representation, as well as the ability for users to inspect which events contributed to each answer, facilitates interpretability and trust.

Theoretically, ReQAP demonstrates that recursive question decomposition aligns well with human-usable models of intent and sub-tasking, echoing advances in chain-of-thought prompting and semantic parsing, but at practical scale and with direct integration of information extraction from unstructured text.

Limitations and Future Directions

The evaluation is currently limited to synthetic and small-scale, privacy-respecting user data. Real-world deployment would involve federated architectures accessing APIs rather than local data exports, and further advances are needed for accommodating more diverse or federated data models. Additionally, although the operator set is easily extensible, novel data modalities (such as images or non-English text) would require additional model engineering.

Future developments in AI may further automate the extraction and integration pipeline, enhance federated privacy-preserving architectures, and support conversational/multi-turn interaction over dynamic personal data.

Conclusion

ReQAP represents a significant advance in privacy-aware, expressive question answering over personal heterogeneous data, reconciling analytical depth, source heterogeneity, and practical deployability. The public release of both system and benchmark is poised to catalyze further research at the intersection of semantic parsing, information extraction, and personal data management.

Markdown Report Issue