QADS Adversarial Dataset
- QADS is an adversarial MRC benchmark that systematically substitutes context-appropriate synonyms from SQuAD 2.0 to probe semantic comprehension.
- It employs a rigorous methodology combining automated synonym retrieval, enhanced Lesk word sense disambiguation, and crowdsourced validation to ensure contextual accuracy.
- Baseline evaluations reveal a sharp drop in model performance, highlighting the limitations of current transformer architectures in handling lexical variations.
The Question and Answer Dataset with common knowledge of Synonyms (QADS) is an adversarial machine reading comprehension (MRC) benchmark designed to probe the handling of commonsense synonymy by contemporary deep learning models. Derived from SQuAD 2.0 via systematic, context-sensitive synonym substitution, QADS preserves the original answer span and question intent while modifying only the surface form. This approach isolates the challenge of contextual lexical substitution and exposes the extent to which MRC models depend on lexical matching over genuine semantic comprehension. QADS specifically challenges models such as ELECTRA and BERT, revealing significant limitations in their capacity to integrate commonsense lexical knowledge, particularly regarding synonyms (Lin et al., 2020).
1. Motivation and Design Rationale
The central motivation behind QADS is to assess the ability of MRC systems to understand and reason about synonymy—a fundamental aspect of human language comprehension and a pervasive form of commonsense knowledge. While progress in MRC has been driven by datasets reliant on surface lexical cues, existing benchmarks such as SQuAD 2.0 primarily test answerability through hand-crafted distractors and do not directly evaluate external commonsense or synonym handling. QADS fills this evaluation gap by systematically substituting context-appropriate synonyms into the question text, thus requiring models to leverage semantic equivalence rather than surface string identity. The design principle is to preserve all elements of the original SQuAD instance—passage, answer, and intent—modifying only the lexical realization of the question. This single-dimension perturbation enables precise diagnosis of synonym-related weaknesses in neural reading comprehension systems (Lin et al., 2020).
2. Dataset Construction Methodology
The QADS construction process is anchored in a rigorous pipeline combining automated synonym substitution, contextual word sense disambiguation (WSD), and human validation. The steps are as follows:
- Base Extraction: Starting from the SQuAD 2.0 development set (~11,000 question-paragraph pairs), questions are tokenized via spaCy. Stop words, numerals, and named entities are excluded from substitution eligibility.
- Synonym Retrieval: For each eligible token , all synsets and semantic relations (hypernyms, hyponyms) are pulled from WordNet 3.0. Example sentences are also harvested for richer sense context.
- Enhanced Lesk WSD: To match the appropriate WordNet synset for within full passage context , the enhanced Lesk algorithm is employed. Each candidate synset's gloss and those of its neighbors (with relations ) are aggregated:
where unites glosses and examples from and its semantic neighbors (Lin et al., 2020).
- Question Generation: Two substitution strategies are used:
- Method A: Replace a single eligible token in a question with a contextually-validated synonym.
- Method B: Replace all eligible tokens with their synonyms. This yields an initial pool of approximately 6,000 questions.
- Crowdsourced Validation: Native-English annotators verify that synonym substitutions are both contextually and semantically valid, flagging ill-formed or out-of-distribution questions. Each 2,000-item block is rechecked for consistency (15–20% overlap), with blocks passing only if agreement is achieved.
3. Dataset Statistics and Composition
QADS currently comprises approximately 5,200 adversarial question-answer (QA) pairs, all clouded from the answerable segment of SQuAD 2.0 dev. The dataset is structured to match the original in several respects:
- Splits: Used as a held-out test set (~5,200 pairs) or randomly divided (80% train, 20% eval) for cross-validation and fine-tuning protocols.
- Answerability: All questions in QADS are guaranteed answerable; no “no answer” items are added.
- Positional and Length Parity: Because QADS is generated via token-level substitution, question lengths and POS tag distributions mirror those of SQuAD 2.0 (average ≈11 tokens per question, predominance of nouns and verbs).
4. Evaluation Protocols and Baseline Modeling
Performance on QADS is assessed using exact match accuracy (Acc), along with standard span-based precision (P), recall (R), and F1 metrics:
Three ELECTRA variants serve as baselines—Small (12-layer, 256-hidden, 14M parameters), Base (12-layer, 768-hidden, 110M), and Large (24-layer, 1024-hidden, 335M)—each fine-tuned on SQuAD 2.0. Two evaluation settings are reported:
- Direct transfer (“QADS only”): Model evaluated on QADS after SQuAD 2.0 fine-tuning.
- Cross-validation (“SQuAD 2.0 + QADS”): Additional fine-tuning on 80% of QADS, evaluation on 20% hold-out.
Table 1: ELECTRA Performance (Accuracy%)
| Dataset / Model | ELECTRA-Small | ELECTRA-Base | ELECTRA-Large |
|---|---|---|---|
| SQuAD 2.0 | 70.10 | 83.27 | 87.85 |
| QADS (no extra fine-tune) | 20.30 | 22.15 | 22.25 |
| SQuAD 2.0 + QADS (CV) | 21.32 | 25.64 | 25.93 |
5. Empirical Results and Analysis
QADS assessment demonstrates a pronounced performance degradation—ELECTRA-Large, for example, plunges from 87.85% accuracy on SQuAD 2.0 to approximately 22.3% on QADS. Even with additional adversarial fine-tuning, gains are marginal (to 25.9% for ELECTRA-Large). The principal failure modes are:
- Inattention to substituted synonyms, with models defaulting to original spans or yielding “no answer.”
- Difficulties exacerbated by infrequent senses or domain-specific synonyms (e.g., “precipitate”→“accelerate” in chemistry). These patterns indicate a structural deficiency in integrating external lexical knowledge—specifically, the type of synonymy that QADS targets—within even state-of-the-art transformer architectures (Lin et al., 2020).
6. Implications for MRC and Future Directions
QADS reveals a clear need for principled means of embedding lexical semantics and commonsense into neural MRC models. Prospective remedies include:
- Integrating explicit lexical modules, such as WordNet-derived graphs, inside the encoder architecture.
- Auxiliary training on WSD or synonym-selection to sensitize representations to sense distinctions.
- Extending adversarial evaluation to cover other semantic shifts, such as paraphrase, hypernymy/antonymy, or part-whole substitutions, and introducing a broader battery of “stress-test” benchmarks involving syntactic variation, coreference, and numerical reasoning.
- Automation pathways involve leveraging deep learning-based WSD to enhance generation quality and scaling adversarial augmentation across MRC corpora.
QADS stands as a targeted adversarial resource, instrumental in quantifying and remedying the latent weaknesses of current MRC paradigms in the domain of synonym-based commonsense reasoning (Lin et al., 2020).