High-Quality Query Correction Dataset

Updated 14 January 2026

High-quality query correction datasets are rigorously curated collections that pair real-world noisy queries with their corrected counterparts, complete with detailed annotations and error metadata.
They integrate diverse construction methodologies including human-edit histories, specialist annotations, synthetic error injection, and ASR outputs to address domain-specific challenges.
Structured splits and evaluation metrics such as F1, BLEU, and accuracy enable repeatable benchmarking and robust training for automatic spelling, query rewriting, and retrieval-augmented generation models.

A high-quality query correction dataset is a rigorously curated resource consisting of paired “noisy” user queries and their corrected forms, typically accompanied by detailed annotations and error-type metadata. Such datasets are foundational for evaluating and training models in automatic spelling correction, query rewriting, retrieval-augmented generation, and domain-specific correction pipelines. The following sections synthesize established datasets and methodologies from recent literature, encompassing construction protocols, statistical properties, domain-specific challenges, error typologies, evaluation metrics, benchmark results, and quality assurance techniques.

1. Construction Methodologies and Domain Coverage

High-quality query correction datasets derive from several distinct domain sources and annotation paradigms, illustrating a spectrum of design principles:

Human-Edit Histories: The MQR dataset (Chu et al., 2019) is constructed from human question revision logs across 303 English-language Stack Exchange sites. Ill-formed queries are aligned with their post-edit, well-formed counterparts, filtered by syntactic templates and character set coverage.
Specialist-Annotated Corrections: MCSCSet (Jiang et al., 2022) targets medical-domain Chinese spelling correction using real-world search queries from Tencent Yidian. Medical-knowledgeable annotators correct or inject errors at character level, validated through a multi-stage review process.
Synthetic Error Injection: The Sandwich Reasoning dataset (Zhang et al., 7 Jan 2026) and QE-RAG (Zhang et al., 5 Apr 2025) introduce controlled error types (e.g., substitutions, deletions, token swaps, keyboard/visual/spelling errors) to clean corpora, ensuring systematic coverage and balance across domains (e.g., e-commerce, medical, open-domain QA).
Automatic Generation from Speech Recognition: The AAM benchmark (Lu et al., 4 Sep 2025) uses ASR systems on public Mandarin speech corpora (Aidatatang, AISHELL-1, MagicData) to obtain realistic ASR-induced query transcriptions, pairing each with its human transcript.
Semantic Parsing Error Correction: In text-to-SQL, synthetic “wrong parses” are derived from semantic parsers tested on Spider, with clause-level error representations rather than solely token-level edits (Chen et al., 2023).

Table 1: Representative Query Correction Datasets

Dataset	Domain(s)	Data Source / Annotation	Size (pairs)
MQR	General (English questions)	Stack Exchange edit histories	427,719
MCSCSet	Medical Chinese	Tencent Yidian queries, expert annot.	~200,000
SandwichR	E-com, Video, Medical	Synthetic error injection	~273,000
QE-RAG	Open-domain QA	Synthetic errors on 6 QA datasets	Varies (eg. NQ: 3,610)
AAM	Mandarin (speech queries)	ASR output vs. human transcript	6,965
Text-to-SQL Corr	SQL parsing (Spider)	Parser-generated errors, auto-aligned	20k–47k/train/model

This diversity reflects the importance of aligning dataset construction to application-specific challenges, error modalities, and linguistic environments.

2. Error Typologies and Annotation Protocols

Error representation and annotation are critical for dataset utility and difficulty:

Types of Errors:
- Character-level errors: Phonological (soundalike), visual (shape similarity), order confusion, omission, and redundancies (MCSCSet (Jiang et al., 2022)).
- Word-level errors: Wrong word (visual or phonetic substitutions), missing word, disorder/swap of tokens (SandwichR (Zhang et al., 7 Jan 2026)).
- Entry errors in QA: Keyboard-proximity (e.g., ‘q’→‘w’), visual-similarity (e.g., ‘o’→‘0’), misspellings, at controlled injection rates (QE-RAG (Zhang et al., 5 Apr 2025)).
- ASR errors: Only token substitution (no insertions/deletions) to mirror SIGHAN conventions (AAM (Lu et al., 4 Sep 2025)).
- Clause-level edits: Edits at SQL clause granularity in semantic parsing (Text-to-SQL Correction (Chen et al., 2023)).
Annotation Mechanisms:
- Specialist manual annotation and review, as in medical spelling correction (primary and secondary annotator, with expert adjudication (Jiang et al., 2022)).
- Automated error synthesis: strict protocols for single-error injection, balanced coverage, and deduplication (SandwichR (Zhang et al., 7 Jan 2026)).
- Automatic filtering and quality control: ensuring absence of personal data, identical noisy-clean pairs, or uninformative stopword changes.
- Quality validation: Inter-annotator agreement (Cohen’s κ), semantic equivalence checks, and multiple error-type labels (MQR (Chu et al., 2019), MCSCSet (Jiang et al., 2022)).

3. Dataset Structure, Splits, and Representations

Dataset organization and normalization enable repeatable benchmarking and model development:

Splits and Representation:
- Typical three-way splits into train, dev, and test sets, with domain/category coverage (Chu et al., 2019, Jiang et al., 2022, Zhang et al., 7 Jan 2026).
- Average query lengths and error distributions reported for realism and comparability.
- Data schemas: JSON-lines format for text pairs, error type metadata, and, in the SQL-correction context, PyDict representations of queries and edits (Chen et al., 2023).
- Strict normalization and cleaning: removal of overly short/long samples, whitespace and punctuation normalization, tokenization (BERT/WordPiece in speech-recognition settings (Lu et al., 4 Sep 2025)).
Controlled injection protocols:
- Uniform sampling over error types and domains (SandwichR).
- Probabilistic error introduction at word/character levels (QE-RAG), using formulas such as:
$\text{average\_EditDist} = \frac{1}{N} \sum_{i=1}^N \mathrm{EditDist}(\text{noisy}_i, \text{clean}_i)$

and error selection probabilities (p_word = 0.3, p_char = 0.3, error-type ratio 3:1:1 for QE-RAG (Zhang et al., 5 Apr 2025)).

4. Evaluation Metrics and Benchmark Results

Rigorous evaluation schemes and well-documented baseline results underpin the usefulness of these datasets:

Metrics:
- Token/character-level: Precision, Recall, F $_1$ , as standard:
$\mathrm{P} = \frac{TP}{TP + FP},\ \mathrm{R} = \frac{TP}{TP + FN},\ F_1 = \frac{2PR}{P + R}$ - Sentence/query-level: Correct only if all tokens/errors are handled correctly. Accuracy and F $_{0.5}$ (favoring precision) for correction tasks (Zhang et al., 7 Jan 2026). - Retrieval/QA: Token-level F $_1$ , Exact Match, Recall@k (QE-RAG (Zhang et al., 5 Apr 2025)). - MT-style: BLEU, ROUGE, METEOR for question rewriting (MQR).
Benchmarks:

Model	Metric	Result (selected)	Dataset
BERT-Corrector	Correction F1	80.49%	MCSCSet
MedBERT-Corrector	Correction F1	80.61%	MCSCSet
Soft-Masked BERT	Correction F1	80.88%	MCSCSet
CTD (proposed)	Sentence Acc.	57.8%	AAM
SandwichR SFT+RL	F $_{0.5}$ /Accuracy	0.221 / 0.213	SandwichR/Ecom
RA-QCG	QA F $_1$ (20% corr)	40.16% (vs. 38.04% for base RAG)	QE-RAG
Transformer (MQR)	BLEU-4	22.1	MQR
CodeT5-corr	EM improvement	+6.5% over no-edit	Text-to-SQL Corr

Domain impact: Substantial degradation observed when applying open-domain correctors to domain-specific queries (e.g., BERT-CSC models F1 dropping from ~80% on SIGHAN-15 to <30% on MCSCSet (Jiang et al., 2022)). In QA retrieval, errorful queries hurt downstream F1, mitigated by correction pipelines (Zhang et al., 5 Apr 2025).

5. Error Analysis, Use Cases, and Recommendations

Advanced datasets support fine-grained analyses and targeted improvements:

Error Analysis:
- Dominant error modalities differ by domain: phonological/visual (Chinese medical), substitution (ASR), word swaps and omissions (user queries), database grounding and structural errors (SQL).
- Semantic drift and partial correction remain challenges in question rewriting and sequence-to-sequence tasks (~30% semantic drift in MQR model outputs (Chu et al., 2019)).
Applications:
- Pre-processing for QA and semantic parsing systems (MQR).
- Benchmarks for correction in retrieval-augmented generation (QE-RAG).
- Robustness testing of query correction under latency constraints (SandwichR).
- Domain-specific error injection and correction modeling for specialized information retrieval.
Recommendations:
- Incorporate domain-specific confusion sets and error typologies for synthetic pre-training (Jiang et al., 2022).
- Leverage multi-modal/contextual signals (images, audio) in medical and spoken query correction.
- Release, and report inter-annotator agreement scores and semantic equivalence statistics where manual annotation is employed.
- Extend datasets to new sub-domains (e.g., specialty medical, mixed-language, code querying) for broader evaluation scope.

6. Accessibility and Extensibility

Prominent datasets are publicly released with open licenses, modular data schemas, and detailed construction recipes.

Repositories: MQR (https://github.com/ZeweiChu/MQR) (Chu et al., 2019), Text-to-SQL Correction (https://github.com/OSU-NLP-Group/Auto-SQL-Correction) (Chen et al., 2023), QE-RAG (repository URL in paper) (Zhang et al., 5 Apr 2025).
Package Support: qe_rag Python package provides loaders and corruption scripts for QA retrieval benchmark reproduction (Zhang et al., 5 Apr 2025).
Best Practices: Adherence to original data normalization, transparent annotation/correction protocols, and explicit reporting of data statistics and splits are recommended for extension and comparative benchmarking.

High-quality query correction datasets play a central role in advancing neural correction models, robust retrieval-augmented generation pipelines, and domain-specific information access systems by providing precisely labeled, diverse, and scalable benchmarks tailored to real-world query entry error distributions.