Papers
Topics
Authors
Recent
Search
2000 character limit reached

High-Quality Query Correction Dataset

Updated 14 January 2026
  • High-quality query correction datasets are rigorously curated collections that pair real-world noisy queries with their corrected counterparts, complete with detailed annotations and error metadata.
  • They integrate diverse construction methodologies including human-edit histories, specialist annotations, synthetic error injection, and ASR outputs to address domain-specific challenges.
  • Structured splits and evaluation metrics such as F1, BLEU, and accuracy enable repeatable benchmarking and robust training for automatic spelling, query rewriting, and retrieval-augmented generation models.

A high-quality query correction dataset is a rigorously curated resource consisting of paired “noisy” user queries and their corrected forms, typically accompanied by detailed annotations and error-type metadata. Such datasets are foundational for evaluating and training models in automatic spelling correction, query rewriting, retrieval-augmented generation, and domain-specific correction pipelines. The following sections synthesize established datasets and methodologies from recent literature, encompassing construction protocols, statistical properties, domain-specific challenges, error typologies, evaluation metrics, benchmark results, and quality assurance techniques.

1. Construction Methodologies and Domain Coverage

High-quality query correction datasets derive from several distinct domain sources and annotation paradigms, illustrating a spectrum of design principles:

  • Human-Edit Histories: The MQR dataset (Chu et al., 2019) is constructed from human question revision logs across 303 English-language Stack Exchange sites. Ill-formed queries are aligned with their post-edit, well-formed counterparts, filtered by syntactic templates and character set coverage.
  • Specialist-Annotated Corrections: MCSCSet (Jiang et al., 2022) targets medical-domain Chinese spelling correction using real-world search queries from Tencent Yidian. Medical-knowledgeable annotators correct or inject errors at character level, validated through a multi-stage review process.
  • Synthetic Error Injection: The Sandwich Reasoning dataset (Zhang et al., 7 Jan 2026) and QE-RAG (Zhang et al., 5 Apr 2025) introduce controlled error types (e.g., substitutions, deletions, token swaps, keyboard/visual/spelling errors) to clean corpora, ensuring systematic coverage and balance across domains (e.g., e-commerce, medical, open-domain QA).
  • Automatic Generation from Speech Recognition: The AAM benchmark (Lu et al., 4 Sep 2025) uses ASR systems on public Mandarin speech corpora (Aidatatang, AISHELL-1, MagicData) to obtain realistic ASR-induced query transcriptions, pairing each with its human transcript.
  • Semantic Parsing Error Correction: In text-to-SQL, synthetic “wrong parses” are derived from semantic parsers tested on Spider, with clause-level error representations rather than solely token-level edits (Chen et al., 2023).

Table 1: Representative Query Correction Datasets

Dataset Domain(s) Data Source / Annotation Size (pairs)
MQR General (English questions) Stack Exchange edit histories 427,719
MCSCSet Medical Chinese Tencent Yidian queries, expert annot. ~200,000
SandwichR E-com, Video, Medical Synthetic error injection ~273,000
QE-RAG Open-domain QA Synthetic errors on 6 QA datasets Varies (eg. NQ: 3,610)
AAM Mandarin (speech queries) ASR output vs. human transcript 6,965
Text-to-SQL Corr SQL parsing (Spider) Parser-generated errors, auto-aligned 20k–47k/train/model

This diversity reflects the importance of aligning dataset construction to application-specific challenges, error modalities, and linguistic environments.

2. Error Typologies and Annotation Protocols

Error representation and annotation are critical for dataset utility and difficulty:

  • Types of Errors:
    • Character-level errors: Phonological (soundalike), visual (shape similarity), order confusion, omission, and redundancies (MCSCSet (Jiang et al., 2022)).
    • Word-level errors: Wrong word (visual or phonetic substitutions), missing word, disorder/swap of tokens (SandwichR (Zhang et al., 7 Jan 2026)).
    • Entry errors in QA: Keyboard-proximity (e.g., ‘q’→‘w’), visual-similarity (e.g., ‘o’→‘0’), misspellings, at controlled injection rates (QE-RAG (Zhang et al., 5 Apr 2025)).
    • ASR errors: Only token substitution (no insertions/deletions) to mirror SIGHAN conventions (AAM (Lu et al., 4 Sep 2025)).
    • Clause-level edits: Edits at SQL clause granularity in semantic parsing (Text-to-SQL Correction (Chen et al., 2023)).
  • Annotation Mechanisms:
    • Specialist manual annotation and review, as in medical spelling correction (primary and secondary annotator, with expert adjudication (Jiang et al., 2022)).
    • Automated error synthesis: strict protocols for single-error injection, balanced coverage, and deduplication (SandwichR (Zhang et al., 7 Jan 2026)).
    • Automatic filtering and quality control: ensuring absence of personal data, identical noisy-clean pairs, or uninformative stopword changes.
    • Quality validation: Inter-annotator agreement (Cohen’s κ), semantic equivalence checks, and multiple error-type labels (MQR (Chu et al., 2019), MCSCSet (Jiang et al., 2022)).

3. Dataset Structure, Splits, and Representations

Dataset organization and normalization enable repeatable benchmarking and model development:

  • Splits and Representation:
    • Typical three-way splits into train, dev, and test sets, with domain/category coverage (Chu et al., 2019, Jiang et al., 2022, Zhang et al., 7 Jan 2026).
    • Average query lengths and error distributions reported for realism and comparability.
    • Data schemas: JSON-lines format for text pairs, error type metadata, and, in the SQL-correction context, PyDict representations of queries and edits (Chen et al., 2023).
    • Strict normalization and cleaning: removal of overly short/long samples, whitespace and punctuation normalization, tokenization (BERT/WordPiece in speech-recognition settings (Lu et al., 4 Sep 2025)).
  • Controlled injection protocols:
    • Uniform sampling over error types and domains (SandwichR).
    • Probabilistic error introduction at word/character levels (QE-RAG), using formulas such as:

    average_EditDist=1Ni=1NEditDist(noisyi,cleani)\text{average\_EditDist} = \frac{1}{N} \sum_{i=1}^N \mathrm{EditDist}(\text{noisy}_i, \text{clean}_i)

    and error selection probabilities (p_word = 0.3, p_char = 0.3, error-type ratio 3:1:1 for QE-RAG (Zhang et al., 5 Apr 2025)).

4. Evaluation Metrics and Benchmark Results

Rigorous evaluation schemes and well-documented baseline results underpin the usefulness of these datasets:

  • Metrics:

    • Token/character-level: Precision, Recall, F1_1, as standard:

    P=TPTP+FP, R=TPTP+FN, F1=2PRP+R\mathrm{P} = \frac{TP}{TP + FP},\ \mathrm{R} = \frac{TP}{TP + FN},\ F_1 = \frac{2PR}{P + R} - Sentence/query-level: Correct only if all tokens/errors are handled correctly. Accuracy and F0.5_{0.5} (favoring precision) for correction tasks (Zhang et al., 7 Jan 2026). - Retrieval/QA: Token-level F1_1, Exact Match, Recall@k (QE-RAG (Zhang et al., 5 Apr 2025)). - MT-style: BLEU, ROUGE, METEOR for question rewriting (MQR).

  • Benchmarks:

Model Metric Result (selected) Dataset
BERT-Corrector Correction F1 80.49% MCSCSet
MedBERT-Corrector Correction F1 80.61% MCSCSet
Soft-Masked BERT Correction F1 80.88% MCSCSet
CTD (proposed) Sentence Acc. 57.8% AAM
SandwichR SFT+RL F0.5_{0.5}/Accuracy 0.221 / 0.213 SandwichR/Ecom
RA-QCG QA F1_1 (20% corr) 40.16% (vs. 38.04% for base RAG) QE-RAG
Transformer (MQR) BLEU-4 22.1 MQR
CodeT5-corr EM improvement +6.5% over no-edit Text-to-SQL Corr
  • Domain impact: Substantial degradation observed when applying open-domain correctors to domain-specific queries (e.g., BERT-CSC models F1 dropping from ~80% on SIGHAN-15 to <30% on MCSCSet (Jiang et al., 2022)). In QA retrieval, errorful queries hurt downstream F1, mitigated by correction pipelines (Zhang et al., 5 Apr 2025).

5. Error Analysis, Use Cases, and Recommendations

Advanced datasets support fine-grained analyses and targeted improvements:

  • Error Analysis:

    • Dominant error modalities differ by domain: phonological/visual (Chinese medical), substitution (ASR), word swaps and omissions (user queries), database grounding and structural errors (SQL).
    • Semantic drift and partial correction remain challenges in question rewriting and sequence-to-sequence tasks (~30% semantic drift in MQR model outputs (Chu et al., 2019)).
  • Applications:
    • Pre-processing for QA and semantic parsing systems (MQR).
    • Benchmarks for correction in retrieval-augmented generation (QE-RAG).
    • Robustness testing of query correction under latency constraints (SandwichR).
    • Domain-specific error injection and correction modeling for specialized information retrieval.
  • Recommendations:
    • Incorporate domain-specific confusion sets and error typologies for synthetic pre-training (Jiang et al., 2022).
    • Leverage multi-modal/contextual signals (images, audio) in medical and spoken query correction.
    • Release, and report inter-annotator agreement scores and semantic equivalence statistics where manual annotation is employed.
    • Extend datasets to new sub-domains (e.g., specialty medical, mixed-language, code querying) for broader evaluation scope.

6. Accessibility and Extensibility

Prominent datasets are publicly released with open licenses, modular data schemas, and detailed construction recipes.


High-quality query correction datasets play a central role in advancing neural correction models, robust retrieval-augmented generation pipelines, and domain-specific information access systems by providing precisely labeled, diverse, and scalable benchmarks tailored to real-world query entry error distributions.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-Quality Query Correction Dataset.