Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLARITY Dataset Overview

Updated 20 January 2026
  • CLARITY Dataset is a hierarchically annotated collection of US presidential QA pairs that evaluate clarity, ambiguity, and evasion in political discourse.
  • It employs a hierarchical taxonomy with three clarity labels and nine evasion strategies, supported by detailed annotation and counterfactual decomposition.
  • The dataset benchmarks human and LLM models using techniques like chain-of-thought prompting and LoRA tuning with metrics such as macro F1 and accuracy.

The CLARITY Dataset is a hierarchically annotated resource for the evaluation and modeling of response clarity, ambiguity, and evasion in political question-answer (QA) exchanges. Developed for clarity detection and classification benchmarks, the dataset consists of thousands of QA pairs drawn from U.S. presidential interviews and is the basis of the SemEval-2026 shared task on identifying clarity and ambiguity in political discourse (Thomas et al., 2024, Prahallad et al., 13 Jan 2026). It provides fine-grained annotation of evasion strategies and has catalyzed research in prompt-based clarity evaluation for both human and LLM-based annotators.

1. Dataset Composition and Annotation Protocol

The CLARITY Dataset comprises QA pairs extracted from televised interviews, press conferences, and debate transcripts with recent U.S. presidents (Bush, Obama, Trump, Biden; coverage: 2006–2023). After de-duplication, the main release contains 2,047 evaluated QA instances and maintains the following splits:

Split Count (initial/cleaned)
Train 3,448 / —
Test 308 / —
Unique QA — / 2,061
Scored — / 2,047

Each example is single-barreled, with “multi-question” exchanges programmatically decomposed into singular QA pairs using gpt-3.5-turbo, supplemented by manual validation and counterfactuals. Three trained annotators (plus one expert resolver) annotate every instance for both clarity and evasion. Gold labels are assigned by majority vote. Reported inter-annotator agreement is substantial: κ ≈ 0.70–0.80 for clarity, κ ≈ 0.60 for evasion (Prahallad et al., 13 Jan 2026).

2. Taxonomy and Label Distribution

CLARITY employs a hierarchical taxonomy:

Top-Level Clarity Labels:

  1. Clear Reply: Single, unequivocal answer to question.
  2. Ambivalent Reply: Reply accommodates multiple plausible interpretations.
  3. Clear Non-Reply: Explicit rejection, ignorance, or request for clarification.

Fine-Grained Evasion/Evasion Techniques (nine categories):

  • Explicit
  • Implicit
  • Dodging
  • Generalization
  • Deflection
  • Partial/Half-Answer
  • Declining to Answer
  • Claims Ignorance
  • Clarification Request

The label distribution (n=2,047):

Clarity Count Percentage
Clear Reply 1,000 48.8%
Ambivalent Reply 800 39.1%
Clear Non-Reply 247 12.1%
Evasion Count %
Explicit 500 24.4%
Implicit 200 9.8%
Dodging 150 7.3%
Generalization 220 10.7%
Deflection 180 8.8%
Partial / Half-Answer 160 7.8%
Declining to Answer 250 12.2%
Claims Ignorance 200 9.8%
Clarification 187 9.1%

CLARITY also annotates topic (14 predefined policy domains plus “Other”) for each QA instance (Prahallad et al., 13 Jan 2026).

3. Annotation Guidelines and Quality Metrics

Annotation is governed by a tutorial with model examples for all leaf labels. Multi-expert adjudication resolves disagreements. The QA decomposition protocol includes deliberate counterfactual (misleading) decompositions to ensure focus on context. Agreement is quantified by Fleiss’ κ:

κ=PˉPˉe1Pˉe\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}

with Pˉ\bar{P} the mean observed agreement and Pˉe\bar{P}_e the expected chance agreement (Thomas et al., 2024).

Performance on human and LLM classification is measured by accuracy, per-class precision/recall/F1_1, macro F1_1, weighted F1_1 (weighted by class occurrence), and Hierarchical Exact Match (HEM), i.e., joint correctness across both hierarchy levels (Prahallad et al., 13 Jan 2026).

4. Model Baselines and Evaluation

CLARITY provides strong benchmarks for both encoder-based and LLM architectures. Baselines span:

  • Encoder models: DeBERTa-base/large, RoBERTa-base/large, XLNet-base/large, fine-tuned for clarity and evasion.
  • LLMs: Llama 2 (7B–70B), Falcon (7B/40B), ChatGPT (gpt-3.5-turbo).
  • Adaptation: zero-shot, few-shot (one-shot per leaf label), chain-of-thought (CoT) prompting, LoRA instruction tuning.

Key results for 3-way clarity recognition (test set, macro F1_1):

Model/Method Accuracy F1_1 (macro)
ZS ChatGPT (direct) 0.649 0.413
CoT ChatGPT (evasion) 0.688 0.510
LoRA Llama-2-70B (direct) 0.759 0.680
LoRA Llama-2-70B (evasion) 0.713 0.710
XLNet-base (direct) 0.694 0.518
RoBERTa-base (direct) 0.640 0.530

Prompt design substantially boosts clarity accuracy (e.g., GPT-5.2 with CoT + 3-shot: 0.59 → 0.60 → 0.64). For fine-grained evasion, the highest accuracy is 0.34 (GPT-5.2 CoT). The explicit class is most reliably detected (precision 0.82, recall 0.42); other categories, such as “Partial/Half-Answer” and “Dodging,” remain error-prone.

5. Prompt Engineering and Automatic Model Evaluation

Research demonstrates that chain-of-thought and in-context demonstration prompts systematically improve LLM clarity classification. Explicit reasoning steps facilitate fine-grained distinctions—especially between Ambivalent and Clear Non-Reply categories. However, prompt engineering alone is insufficient to resolve confusions among nuanced evasion strategies (e.g., Implicit vs Dodging) (Prahallad et al., 13 Jan 2026).

Topic identification is also evaluated. Accuracy for topic assignment increases from 60% (simple prompt) to 74% (CoT), but overlapping categories (e.g., Economy vs. Healthcare) remain problematic.

6. Limitations, Observed Patterns, and Future Work

CLARITY is restricted to English and U.S. political discourse; non-verbal and paralinguistic signals, which often delineate evasion, are not captured. The reliance on LLM-driven preprocessing for QA decomposition can introduce errors. Annotator expertise is intentionally generalized, but future releases may explore multi-perspective or expert-only annotation protocols.

Patterns noted include higher explicit-reply rates for single-part QAs, and presidency-based variability (e.g., Trump providing the most explicit replies). Answer grounding with multi-part QAs disproportionately challenges LLMs—in some cases lowering F1_1 by up to 0.16—whereas human annotator agreement is less affected (Δκ ≈ 0.03). Use of person names increases model error, suggesting significant reliance on pre-encoded knowledge in LLMs (Thomas et al., 2024).

Directions for further study include cross-lingual extension, multimodal (audio/video) integration for richer evasion detection, and adversarial settings where models themselves generate ambiguous answers.

7. Applications and Usage

CLARITY enables fine-grained, large-scale analysis of evasive and ambiguous tactics in political interviews, robust AI fact-checking and monitoring pipelines, and research into QA model robustness. The taxonomy and self-contained evaluation framework provide a rigorous empirical foundation for measuring “clarity” and “evasion” across models, datasets, and languages, with all code, prompts, and data splits made publicly available (Thomas et al., 2024, Prahallad et al., 13 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLARITY Dataset.