Conditional Divergent Association Task (CDAT)
- CDAT is an automated evaluation paradigm that integrates context cues and statistical controls to assess creativity in LLM outputs based on both novelty and appropriateness.
- It employs a minimal contextual constraint—a single-word cue—to ensure generated outputs remain semantically associated, thereby preventing off-topic word selection.
- CDAT utilizes embedding-based metrics and robust statistical tests to benchmark creative trade-offs across various LLM families in a reproducible, high-throughput manner.
The Conditional Divergent Association Task (CDAT) is an automated evaluation paradigm that operationalizes a multidimensional theory of creativity—incorporating both novelty and appropriateness—within the assessment of LLMs. CDAT addresses key theoretical and methodological limitations found in traditional divergent association metrics, especially the standard Divergent Association Task (DAT), by introducing contextual constraints and dedicated statistical controls. Its design facilitates reproducible, high-throughput benchmarking of model outputs, enabling nuanced analysis of creative trade-offs and systematic comparison across model classes (Nakajima et al., 28 Jan 2026).
1. Motivation and Theoretical Foundations
Research in human creativity consistently emphasizes the dual requirements of novelty (originality, unexpectedness) and appropriateness (relevance, contextuality) for outputs to be deemed genuinely creative (cf. Runco & Jaeger, 2012; Cropley, 2006; Diedrich et al., 2015). The original Divergent Association Task (DAT), operationalized in computational studies as well as psychological work (Olson et al., 2021), quantifies creativity solely as the average semantic distance among a list of generated words. This focus on divergence neglects appropriateness, making DAT vulnerable to spurious inflation: random word samplers or models explicitly optimizing divergence can outperform humans or LLMs, despite providing outputs that are devoid of meaning or relevance.
CDAT introduces a minimal yet robust contextual constraint—a prompt cue—to enforce thematic association among output terms. High novelty is only rewarded when coupled with sufficient appropriateness, thus separating random noise from genuine creative behavior in both humans and machines. This reformulation directly mirrors psychological findings that true creativity occurs between the extremes of randomness (maximum novelty, minimum appropriateness) and conformity (maximum appropriateness, minimum novelty), addressing the core theoretical gap in previous automated measures of creativity (Nakajima et al., 28 Jan 2026).
2. Formal Definition and Task Mechanics
Let denote a single-word cue noun, and let be a list of valid generated single-word common nouns (with following Olson et al., 2021). Words are embedded using a sentence-BERT model, producing vectors . CDAT employs the following metrics:
- Novelty: For each unordered pair ,
The list’s overall novelty is:
- Appropriateness: For each word ,
The list’s overall appropriateness is:
To prevent the pursuit of novelty from incentivizing off-topic word selection, CDAT applies a minimal appropriateness gate. For each model and cue , a two-sided Welch’s t-test (FDR ) asserts:
where “Random” refers to uniform sampling of nouns from WordNet, independent of the cue.
Only models passing this appropriateness criterion are scored for creative novelty. The scalar CDAT score for model is:
Model performance may also be visualized in the 2D plane of (appropriateness, novelty), extracting a Pareto front of non-dominated points to analyze optimal trade-offs (Nakajima et al., 28 Jan 2026).
3. Methodological Workflow and Evaluation Protocol
Cue Selection: Start from the 10,000 most frequent Brown-corpus words, filtering to singular common nouns via NLTK and WordNet. Remaining verbs or proper nouns are removed using GPT-4.1 nano, yielding a CDAT cue set of 539 nouns.
Prompting and Sampling: Sixteen LLMs (including GPT-3.5, GPT-4, Claude 3 Opus, Gemini 2.0 Flash, Llama 3.1 8B Instruct) are evaluated with the prompt: “Please enter 10 words that are as different from each other as possible … yet semantically associated with the following cue word: {cue},” Sampling is performed at temperatures ; each model responds to all cues at least once.
Baselines:
- Random: 500 repetitions of 10 nouns drawn uniformly from WordNet, independent of the cue.
- Common: 1 run per cue, with GPT-4.1 nano instructed to list the 10 words most semantically associated to .
Scoring: Responses undergo a validity filter (single-word, common nouns, no duplicates, etc.), followed by computation of novelty and appropriateness, application of the appropriateness gate, and reporting of the conditional mean novelty (CDAT score). Model points in (Appropriateness, Novelty) space and their Pareto fronts are visualized. Two non-primary diagnostics—elbow distance to the Random–Common trade-off line and Euclidean distance to an aggregated human reference—offer further qualitative insight (Nakajima et al., 28 Jan 2026).
| CDAT Component | Description | Implementation |
|---|---|---|
| Cue selection | 539 corpus-derived, filtered cue nouns | Corpus, NLTK, GPT |
| Models evaluated | 16 LLMs, various sizes/families | Prompt-based |
| Appropriateness gate | Welch’s t-test vs. random baseline | FDR |
4. Rationale for CDAT Over DAT
DAT rewards semantic dispersion among word choices with no constraint on contextual relevance. Empirical findings show both the random baseline (uniform WordNet sampling) and a DAT-maximizing baseline outperform all state-of-the-art LLMs on DAT, demonstrating the metric's lack of discriminative power regarding genuine creativity. CDAT’s novelty is conditional: only lists exceeding the contextual appropriateness of random word lists are scored, eliminating "cheating" by off-topic dispersion and preventing models from gaming novelty at the cost of meaning.
This structure mirrors findings in human creativity research where random outputs rate high for novelty but low for appropriateness, and highly associated outputs display the reverse. By applying an appropriateness gate, CDAT ensures that model outputs must balance novelty and relevance, restoring the core duality of creativity into automated evaluation frameworks (Nakajima et al., 28 Jan 2026).
5. Empirical Findings and Interpretive Implications
Key findings under the CDAT evaluation protocol include:
- DAT Invalidity: Under SBERT embeddings, no LLM family surpasses the Random baseline on DAT, and the DAT-maximization prompt outperforms nearly all LLMs.
- Universal Appropriateness Gate Passing: Every model tested significantly exceeds the appropriateness of the Random baseline ().
- Family-Level Trade-offs: In the (Appropriateness, Novelty) plane at , small and latency-optimized models (e.g., Gemini 2.0 Flash, Llama 3.1 8B Instruct) cluster at high novelty/low appropriateness, while large, instruction-tuned models (e.g., GPT-4, Gemini 2.5 Pro) cluster at high appropriateness/low novelty. Both occupy the Pareto front, highlighting systemic trade-offs between contextuality and divergence.
- Scalar Index Trends: Smaller models yield the highest conditional novelty; larger models, impacted by instruction tuning and safety alignment, show reduced novelty despite higher temperature sampling. This suggests that safety alignment and instruction tuning may "collapse" the output distribution, constraining creative diversity.
CDAT partitions noise from genuine creative divergence, exposes systematic differences across model spectra, and indicates that higher appropriateness—favored by advanced alignment—co-occurs with suppressed creative divergence (Nakajima et al., 28 Jan 2026).
6. Implementation, Reproducibility, and Extensions
The CDAT framework is modular and transparent: all metric computations are embedding-based, applying objectified rules for scoring and gating. The simplicity of this approach affords easy adaptation: alternative cue sets, embedding spaces, or even multimodal adaptions can be substituted with minimal retraining of the evaluation pipeline.
Pareto-front analyses and visualization of model behavior in appropriateness-vs-novelty spaces permit diagnostic comparisons and family-level benchmarking, supporting both operational model selection and theoretical investigations into the effects of tuning, alignment, and scaling on creative capacities.
This suggests that leveraging embedding-based, context-conditional creativity metrics may inform future model alignment strategies and benchmark development. CDAT’s codebase and dataset are publicly released to facilitate these directions (Nakajima et al., 28 Jan 2026).