CLARITY Dataset Overview
- CLARITY Dataset is a hierarchically annotated collection of US presidential QA pairs that evaluate clarity, ambiguity, and evasion in political discourse.
- It employs a hierarchical taxonomy with three clarity labels and nine evasion strategies, supported by detailed annotation and counterfactual decomposition.
- The dataset benchmarks human and LLM models using techniques like chain-of-thought prompting and LoRA tuning with metrics such as macro F1 and accuracy.
The CLARITY Dataset is a hierarchically annotated resource for the evaluation and modeling of response clarity, ambiguity, and evasion in political question-answer (QA) exchanges. Developed for clarity detection and classification benchmarks, the dataset consists of thousands of QA pairs drawn from U.S. presidential interviews and is the basis of the SemEval-2026 shared task on identifying clarity and ambiguity in political discourse (Thomas et al., 2024, Prahallad et al., 13 Jan 2026). It provides fine-grained annotation of evasion strategies and has catalyzed research in prompt-based clarity evaluation for both human and LLM-based annotators.
1. Dataset Composition and Annotation Protocol
The CLARITY Dataset comprises QA pairs extracted from televised interviews, press conferences, and debate transcripts with recent U.S. presidents (Bush, Obama, Trump, Biden; coverage: 2006–2023). After de-duplication, the main release contains 2,047 evaluated QA instances and maintains the following splits:
| Split | Count (initial/cleaned) |
|---|---|
| Train | 3,448 / — |
| Test | 308 / — |
| Unique QA | — / 2,061 |
| Scored | — / 2,047 |
Each example is single-barreled, with “multi-question” exchanges programmatically decomposed into singular QA pairs using gpt-3.5-turbo, supplemented by manual validation and counterfactuals. Three trained annotators (plus one expert resolver) annotate every instance for both clarity and evasion. Gold labels are assigned by majority vote. Reported inter-annotator agreement is substantial: κ ≈ 0.70–0.80 for clarity, κ ≈ 0.60 for evasion (Prahallad et al., 13 Jan 2026).
2. Taxonomy and Label Distribution
CLARITY employs a hierarchical taxonomy:
Top-Level Clarity Labels:
- Clear Reply: Single, unequivocal answer to question.
- Ambivalent Reply: Reply accommodates multiple plausible interpretations.
- Clear Non-Reply: Explicit rejection, ignorance, or request for clarification.
Fine-Grained Evasion/Evasion Techniques (nine categories):
- Explicit
- Implicit
- Dodging
- Generalization
- Deflection
- Partial/Half-Answer
- Declining to Answer
- Claims Ignorance
- Clarification Request
The label distribution (n=2,047):
| Clarity | Count | Percentage |
|---|---|---|
| Clear Reply | 1,000 | 48.8% |
| Ambivalent Reply | 800 | 39.1% |
| Clear Non-Reply | 247 | 12.1% |
| Evasion | Count | % |
|---|---|---|
| Explicit | 500 | 24.4% |
| Implicit | 200 | 9.8% |
| Dodging | 150 | 7.3% |
| Generalization | 220 | 10.7% |
| Deflection | 180 | 8.8% |
| Partial / Half-Answer | 160 | 7.8% |
| Declining to Answer | 250 | 12.2% |
| Claims Ignorance | 200 | 9.8% |
| Clarification | 187 | 9.1% |
CLARITY also annotates topic (14 predefined policy domains plus “Other”) for each QA instance (Prahallad et al., 13 Jan 2026).
3. Annotation Guidelines and Quality Metrics
Annotation is governed by a tutorial with model examples for all leaf labels. Multi-expert adjudication resolves disagreements. The QA decomposition protocol includes deliberate counterfactual (misleading) decompositions to ensure focus on context. Agreement is quantified by Fleiss’ κ:
with the mean observed agreement and the expected chance agreement (Thomas et al., 2024).
Performance on human and LLM classification is measured by accuracy, per-class precision/recall/F, macro F, weighted F (weighted by class occurrence), and Hierarchical Exact Match (HEM), i.e., joint correctness across both hierarchy levels (Prahallad et al., 13 Jan 2026).
4. Model Baselines and Evaluation
CLARITY provides strong benchmarks for both encoder-based and LLM architectures. Baselines span:
- Encoder models: DeBERTa-base/large, RoBERTa-base/large, XLNet-base/large, fine-tuned for clarity and evasion.
- LLMs: Llama 2 (7B–70B), Falcon (7B/40B), ChatGPT (gpt-3.5-turbo).
- Adaptation: zero-shot, few-shot (one-shot per leaf label), chain-of-thought (CoT) prompting, LoRA instruction tuning.
Key results for 3-way clarity recognition (test set, macro F):
| Model/Method | Accuracy | F (macro) |
|---|---|---|
| ZS ChatGPT (direct) | 0.649 | 0.413 |
| CoT ChatGPT (evasion) | 0.688 | 0.510 |
| LoRA Llama-2-70B (direct) | 0.759 | 0.680 |
| LoRA Llama-2-70B (evasion) | 0.713 | 0.710 |
| XLNet-base (direct) | 0.694 | 0.518 |
| RoBERTa-base (direct) | 0.640 | 0.530 |
Prompt design substantially boosts clarity accuracy (e.g., GPT-5.2 with CoT + 3-shot: 0.59 → 0.60 → 0.64). For fine-grained evasion, the highest accuracy is 0.34 (GPT-5.2 CoT). The explicit class is most reliably detected (precision 0.82, recall 0.42); other categories, such as “Partial/Half-Answer” and “Dodging,” remain error-prone.
5. Prompt Engineering and Automatic Model Evaluation
Research demonstrates that chain-of-thought and in-context demonstration prompts systematically improve LLM clarity classification. Explicit reasoning steps facilitate fine-grained distinctions—especially between Ambivalent and Clear Non-Reply categories. However, prompt engineering alone is insufficient to resolve confusions among nuanced evasion strategies (e.g., Implicit vs Dodging) (Prahallad et al., 13 Jan 2026).
Topic identification is also evaluated. Accuracy for topic assignment increases from 60% (simple prompt) to 74% (CoT), but overlapping categories (e.g., Economy vs. Healthcare) remain problematic.
6. Limitations, Observed Patterns, and Future Work
CLARITY is restricted to English and U.S. political discourse; non-verbal and paralinguistic signals, which often delineate evasion, are not captured. The reliance on LLM-driven preprocessing for QA decomposition can introduce errors. Annotator expertise is intentionally generalized, but future releases may explore multi-perspective or expert-only annotation protocols.
Patterns noted include higher explicit-reply rates for single-part QAs, and presidency-based variability (e.g., Trump providing the most explicit replies). Answer grounding with multi-part QAs disproportionately challenges LLMs—in some cases lowering F by up to 0.16—whereas human annotator agreement is less affected (Δκ ≈ 0.03). Use of person names increases model error, suggesting significant reliance on pre-encoded knowledge in LLMs (Thomas et al., 2024).
Directions for further study include cross-lingual extension, multimodal (audio/video) integration for richer evasion detection, and adversarial settings where models themselves generate ambiguous answers.
7. Applications and Usage
CLARITY enables fine-grained, large-scale analysis of evasive and ambiguous tactics in political interviews, robust AI fact-checking and monitoring pipelines, and research into QA model robustness. The taxonomy and self-contained evaluation framework provide a rigorous empirical foundation for measuring “clarity” and “evasion” across models, datasets, and languages, with all code, prompts, and data splits made publicly available (Thomas et al., 2024, Prahallad et al., 13 Jan 2026).