CLARITY Dataset Overview

Updated 20 January 2026

CLARITY Dataset is a hierarchically annotated collection of US presidential QA pairs that evaluate clarity, ambiguity, and evasion in political discourse.
It employs a hierarchical taxonomy with three clarity labels and nine evasion strategies, supported by detailed annotation and counterfactual decomposition.
The dataset benchmarks human and LLM models using techniques like chain-of-thought prompting and LoRA tuning with metrics such as macro F1 and accuracy.

The CLARITY Dataset is a hierarchically annotated resource for the evaluation and modeling of response clarity, ambiguity, and evasion in political question-answer (QA) exchanges. Developed for clarity detection and classification benchmarks, the dataset consists of thousands of QA pairs drawn from U.S. presidential interviews and is the basis of the SemEval-2026 shared task on identifying clarity and ambiguity in political discourse (Thomas et al., 2024, Prahallad et al., 13 Jan 2026). It provides fine-grained annotation of evasion strategies and has catalyzed research in prompt-based clarity evaluation for both human and LLM-based annotators.

1. Dataset Composition and Annotation Protocol

The CLARITY Dataset comprises QA pairs extracted from televised interviews, press conferences, and debate transcripts with recent U.S. presidents (Bush, Obama, Trump, Biden; coverage: 2006–2023). After de-duplication, the main release contains 2,047 evaluated QA instances and maintains the following splits:

Split	Count (initial/cleaned)
Train	3,448 / —
Test	308 / —
Unique QA	— / 2,061
Scored	— / 2,047

Each example is single-barreled, with “multi-question” exchanges programmatically decomposed into singular QA pairs using gpt-3.5-turbo, supplemented by manual validation and counterfactuals. Three trained annotators (plus one expert resolver) annotate every instance for both clarity and evasion. Gold labels are assigned by majority vote. Reported inter-annotator agreement is substantial: κ ≈ 0.70–0.80 for clarity, κ ≈ 0.60 for evasion (Prahallad et al., 13 Jan 2026).

2. Taxonomy and Label Distribution

CLARITY employs a hierarchical taxonomy:

Top-Level Clarity Labels:

Clear Reply: Single, unequivocal answer to question.
Ambivalent Reply: Reply accommodates multiple plausible interpretations.
Clear Non-Reply: Explicit rejection, ignorance, or request for clarification.

Fine-Grained Evasion/Evasion Techniques (nine categories):

Explicit
Implicit
Dodging
Generalization
Deflection
Partial/Half-Answer
Declining to Answer
Claims Ignorance
Clarification Request

The label distribution (n=2,047):

Clarity	Count	Percentage
Clear Reply	1,000	48.8%
Ambivalent Reply	800	39.1%
Clear Non-Reply	247	12.1%

Evasion	Count	%
Explicit	500	24.4%
Implicit	200	9.8%
Dodging	150	7.3%
Generalization	220	10.7%
Deflection	180	8.8%
Partial / Half-Answer	160	7.8%
Declining to Answer	250	12.2%
Claims Ignorance	200	9.8%
Clarification	187	9.1%

CLARITY also annotates topic (14 predefined policy domains plus “Other”) for each QA instance (Prahallad et al., 13 Jan 2026).

3. Annotation Guidelines and Quality Metrics

Annotation is governed by a tutorial with model examples for all leaf labels. Multi-expert adjudication resolves disagreements. The QA decomposition protocol includes deliberate counterfactual (misleading) decompositions to ensure focus on context. Agreement is quantified by Fleiss’ κ:

$\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}$

with $\bar{P}$ the mean observed agreement and $\bar{P}_e$ the expected chance agreement (Thomas et al., 2024).

Performance on human and LLM classification is measured by accuracy, per-class precision/recall/F $_1$ , macro F $_1$ , weighted F $_1$ (weighted by class occurrence), and Hierarchical Exact Match (HEM), i.e., joint correctness across both hierarchy levels (Prahallad et al., 13 Jan 2026).

4. Model Baselines and Evaluation

CLARITY provides strong benchmarks for both encoder-based and LLM architectures. Baselines span:

Encoder models: DeBERTa-base/large, RoBERTa-base/large, XLNet-base/large, fine-tuned for clarity and evasion.
LLMs: Llama 2 (7B–70B), Falcon (7B/40B), ChatGPT (gpt-3.5-turbo).
Adaptation: zero-shot, few-shot (one-shot per leaf label), chain-of-thought (CoT) prompting, LoRA instruction tuning.

Key results for 3-way clarity recognition (test set, macro F $_1$ ):

Model/Method	Accuracy	F $_1$ (macro)
ZS ChatGPT (direct)	0.649	0.413
CoT ChatGPT (evasion)	0.688	0.510
LoRA Llama-2-70B (direct)	0.759	0.680
LoRA Llama-2-70B (evasion)	0.713	0.710
XLNet-base (direct)	0.694	0.518
RoBERTa-base (direct)	0.640	0.530

Prompt design substantially boosts clarity accuracy (e.g., GPT-5.2 with CoT + 3-shot: 0.59 → 0.60 → 0.64). For fine-grained evasion, the highest accuracy is 0.34 (GPT-5.2 CoT). The explicit class is most reliably detected (precision 0.82, recall 0.42); other categories, such as “Partial/Half-Answer” and “Dodging,” remain error-prone.

5. Prompt Engineering and Automatic Model Evaluation

Research demonstrates that chain-of-thought and in-context demonstration prompts systematically improve LLM clarity classification. Explicit reasoning steps facilitate fine-grained distinctions—especially between Ambivalent and Clear Non-Reply categories. However, prompt engineering alone is insufficient to resolve confusions among nuanced evasion strategies (e.g., Implicit vs Dodging) (Prahallad et al., 13 Jan 2026).

Topic identification is also evaluated. Accuracy for topic assignment increases from 60% (simple prompt) to 74% (CoT), but overlapping categories (e.g., Economy vs. Healthcare) remain problematic.

6. Limitations, Observed Patterns, and Future Work

CLARITY is restricted to English and U.S. political discourse; non-verbal and paralinguistic signals, which often delineate evasion, are not captured. The reliance on LLM-driven preprocessing for QA decomposition can introduce errors. Annotator expertise is intentionally generalized, but future releases may explore multi-perspective or expert-only annotation protocols.

Patterns noted include higher explicit-reply rates for single-part QAs, and presidency-based variability (e.g., Trump providing the most explicit replies). Answer grounding with multi-part QAs disproportionately challenges LLMs—in some cases lowering F $_1$ by up to 0.16—whereas human annotator agreement is less affected (Δκ ≈ 0.03). Use of person names increases model error, suggesting significant reliance on pre-encoded knowledge in LLMs (Thomas et al., 2024).

Directions for further study include cross-lingual extension, multimodal (audio/video) integration for richer evasion detection, and adversarial settings where models themselves generate ambiguous answers.

7. Applications and Usage

CLARITY enables fine-grained, large-scale analysis of evasive and ambiguous tactics in political interviews, robust AI fact-checking and monitoring pipelines, and research into QA model robustness. The taxonomy and self-contained evaluation framework provide a rigorous empirical foundation for measuring “clarity” and “evasion” across models, datasets, and languages, with all code, prompts, and data splits made publicly available (Thomas et al., 2024, Prahallad et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

"I Never Said That": A dataset, taxonomy and baselines on response clarity classification (2024)

Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLARITY Dataset.

CLARITY Dataset Overview

1. Dataset Composition and Annotation Protocol

2. Taxonomy and Label Distribution

3. Annotation Guidelines and Quality Metrics

4. Model Baselines and Evaluation

5. Prompt Engineering and Automatic Model Evaluation

6. Limitations, Observed Patterns, and Future Work

7. Applications and Usage

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CLARITY Dataset Overview

1. Dataset Composition and Annotation Protocol

2. Taxonomy and Label Distribution

3. Annotation Guidelines and Quality Metrics

4. Model Baselines and Evaluation

5. Prompt Engineering and Automatic Model Evaluation

6. Limitations, Observed Patterns, and Future Work

7. Applications and Usage

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research