Consensual Assessment Technique (CAT)

Updated 30 January 2026

Consensual Assessment Technique (CAT) is a creativity evaluation protocol that relies on subjective, holistic judgments by domain experts using numeric scales.
The method aggregates independent ratings with statistical measures like Cronbach’s α and ICC to achieve a reliable consensus.
Applications range from artistic and literary assessments to LLM-based automated evaluations, enhancing scalability and replicability.

The Consensual Assessment Technique (CAT) is a creativity evaluation protocol predicated on holistic, domain-informed judgments by independent experts. Originating with Amabile (1983), CAT eschews decomposition of creativity into constituent sub-measures, resting instead on the aggregated consensus arising from multiple expert raters scoring artifacts along a simple numeric scale. This approach is considered a benchmark for empirical creativity assessment, validated across domains and implemented in both human and automated settings (Sawicki et al., 26 Feb 2025).

1. Conceptual Foundations

CAT is defined by its reliance on subjective, holistic evaluation by domain-knowledgeable raters. The technique posits that creativity is best apprehended through the internalized schemas and tacit expertise of practitioners rather than prescriptive rubrics. Raters assess artifacts independently, applying their own conception of "creativity" without intervention or fixed definitions. Judgments are anonymized, both with respect to the artifact producer and across raters themselves. Evaluations are typically performed using at least 3-point, but more commonly 5- or 7-point, numeric scales. Reliability of the panel is quantitatively assessed, most frequently using Cronbach’s α or intraclass correlation coefficient (ICC), with values above 0.70 regarded as sufficient.

2. Standard CAT Methodology and Extensions

A classical CAT protocol proceeds as follows:

Selection of a panel of domain experts (frequently 2–40, mean ≈ 10).
Independent rating of all submitted artifacts by each judge, who applies internal holistic criteria without external guidance.
Each artifact receives a numeric rating; criteria commonly include quality, novelty, and typicality, each verbally but not operationally defined.
Judgments are codified anonymously; no rater knows the identity of the artifact producer or other raters’ scores.
Consensus is achieved by aggregating artifact scores (mean or rank order). Panel reliability is validated by inter-rater statistics (ICC or α).

Sawicki et al. (Sawicki et al., 26 Feb 2025) extend CAT to the automated domain using LLMs as pseudo-judges. Notably, they treat independent API runs (using varied temperature settings) as distinct raters and instruct the models to output ratings or ordered rankings on a numeric scale for specified creativity criteria, upholding the central CAT principle of holistic, independent judgment.

3. Empirical Implementation: Poetry Evaluation

In Sawicki et al. (Sawicki et al., 26 Feb 2025), a CAT-inspired methodology is applied to a corpus of 90 English-language poems, stratified by publication venue—Poetry magazine ("Good"), mid-tier journals ("Medium"), and an unreviewed online forum ("Bad")—serving as a proxy for ground truth. Comparisons are made between crowdsourced, non-expert human judges and two LLMs (Claude-3-Opus, GPT-4o). Human raters apply nine criteria (quality, typicality, novelty, etc.) via Likert-style scales, while LLMs assess five criteria (creativity, quality, innovativeness, similarity, poeticness) on a 1–5 scale, using forced-choice ranking and temperature variance to simulate independent, holistic judgments.

Evaluator	Scale/Criteria	Independence Simulation
Human non-experts	9 criteria, Likert (1–5)	Crowdsourced, independent
Claude-3-Opus	5 criteria, integer (1–5), rank	API runs, temperature=1.0
GPT-4o	5 criteria, integer (1–5), rank	API runs, temperature=1.0

Model outputs are aggregated such that each poem is scored across multiple "judges" (LLM runs or random subsets), then ranked by mean score per criterion. Outlier removal is not explicitly conducted; incomplete outputs are re-run.

4. Consensus, Reliability, and Disagreement Handling

Consensus in CAT is operationalized by averaging independent ratings per artifact. In the cited study, raw numeric scores or implicit ranks generated by each LLM run are averaged for each poem, enabling robust consensus rankings. Reliability is quantified by computing k-judge variants of ICC (ICC1k, ICC2k, ICC3k), with the formula ICC(1,k) = (BMS – EMS) ÷ BMS; BMS denotes between-subjects mean square, and EMS denotes error mean square. Observed ICC values in the range 0.90–0.99 across all criteria and models indicate near-perfect reliability.

Divergences from classical CAT mainly involve the scale of judges and the use of forced-choice ranking, the latter of which increases reliability by imposing explicit order constraints, similar to Bradley–Terry models.

5. Alignment and Adaptation for Automated Evaluators

LLM-based CAT adaptation maintains several core principles:

Each LLM run (API call with specified prompt and temperature) represents an independent judge.
Models are not apprised of artifact author identity, nor inter-run outputs.
Numeric rating scales (1–5) match typical CAT protocol.
Brief textual descriptions define each criterion, preserving holistic judgment.

Innovative departures include large-scale parallelization (hundreds to thousands of runs), prompt engineering to specify criteria with minimal guidance, and forced ranking to bolster statistical robustness.

6. Empirical Strengths, Limitations, and Applications

The primary strengths observed in the study are:

LLMs surpass non-expert human judges in matching the venue-based ground truth (Spearman ρ ≈ 0.87 for LLMs vs. 0.38 for human Novelty).
Forced-choice, in-context ranking outperforms classification approaches.
High reliability (ICC > 0.90) persists despite run-to-run variability in model responses.
Smaller, focused batches (15 poems) yield higher accuracy than all-in-one prompts.

Notable limitations and caveats include:

Absence of true domain experts in the human panel; only non-expert crowdsourced ratings are compared.
The ground truth (publication venue) is an operational proxy rather than a direct measure of poetic quality.
LLMs may exploit superficial cues; holistic "understanding" remains controversial.
Prompt-sensitivity and future model updates may alter performance.

CAT-inspired automated assessment demonstrates potential for extension to story evaluation, narrative judgments, visual and musical artifact evaluation (potentially with multi-modal model architectures), peer review triage in scientific publishing, and educational feedback for writing assignments—where high consistency, scaleability, and cost-effectiveness are critical.

7. Theoretical and Practical Significance

CAT operationalizes expert consensus in creativity assessment without prescriptive analytic deconstruction, facilitating reliable evaluation that adapts flexibly across domains. Automated CAT with LLMs substantially augments scalability and inter-rater reliability, though further validation with true domain experts and critical scrutiny of LLM mechanisms is warranted. The approach illuminates a pathway toward efficient, replicable creativity assessment in digital environments, applicable to both research and practical deployment scenarios (Sawicki et al., 26 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Consensual Assessment Technique (CAT).