Art or Artifice? Large Language Models and the False Promise of Creativity
Abstract: Researchers have argued that LLMs exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.
- Muhammad M Mahmoud Abdel Latif. 2013. What do we mean by writing fluency and how can it be validly measured? Applied linguistics 34, 1 (2013), 99–105.
- Joan Accocela. 2012. On Bad Endings. The NewYorker (2012). https://www.newyorker.com/books/page-turner/on-bad-endings
- Teresa M Amabile. 1982. Social psychology of creativity: A consensual assessment technique. Journal of personality and social psychology 43, 5 (1982), 997.
- Anthropic. 2022. Introducing Claude. (2022). https://www.anthropic.com/index/introducing-claude
- John Baer. 2014. Creativity and divergent thinking: A task-specific approach. Psychology Press.
- John Baer and Sharon S McKool. 2009. Assessing creativity using the consensual assessment technique. In Handbook of research on assessment technologies, methods, and applications in higher education. IGI Global, 65–77.
- Roger E Beaty and Dan R Johnson. 2021. Automating creativity assessment with SemDis: An open platform for computing semantic distance. Behavior research methods 53, 2 (2021), 757–780.
- Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 313–322.
- John B Biggs and Kevin F Collis. 1982. The psychological structure of creative writing. Australian Journal of Education 26, 1 (1982), 59–70.
- Michael M Boardman. 1992. Narrative Innovation and Incoherence: Ideology in Defoe, Goldsmith, Austen, Eliot, and Hemingway. Duke University Press.
- When Design Novices and LEGO® Meet: Stimulating Creative Thinking for Interface Design. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3313831.3376495
- Writing fiction: A guide to narrative craft. University of Chicago Press.
- Rüdiger Campe and Julia Weber. 2014. Rethinking Emotion: Interiority and Exteriority in Premodern, Modern, and Contemporary Thought. Vol. 15. Walter de Gruyter GmbH & Co KG.
- How is ChatGPT’s behavior changing over time? arXiv preprint arXiv:2307.09009 (2023).
- Ambient Adventures: Teaching ChatGPT on Developing Complex Stories. arXiv preprint arXiv:2308.01734 (2023).
- The Intersection of Users, Roles, Interactions, and Technologies in Creativity Support Tools. In Proceedings of the 2021 ACM Designing Interactive Systems Conference (Virtual Event, USA) (DIS ’21). Association for Computing Machinery, New York, NY, USA, 1817–1833. https://doi.org/10.1145/3461778.3462050
- All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 7282–7296. https://doi.org/10.18653/v1/2021.acl-long.565
- Roy Peter Clark. 2008. Writing tools: 55 essential strategies for every writer. Little, Brown Spark.
- Gregory Currie. 1990. The nature of fiction. Cambridge University Press.
- Mark Doty. 2014. The art of description: World into word. Graywolf Press.
- David Fishelov. 1990. Types of character, characteristics of types. Style (1990), 422–439.
- Linda Flower and John R Hayes. 1981. A cognitive process theory of writing. College composition and communication 32, 4 (1981), 365–387.
- Edward Morgan Forster. 1927. Aspects of the Novel. Harcourt, Brace.
- Nigel Fountain. 2012. Clichés: Avoid them like the plague. Michael O’Mara Books.
- Norman Friedman. 1955. Point of view in fiction: the development of a critical concept. PMlA 70, 5 (1955), 1160–1184.
- Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554 (2023).
- Social Dynamics of AI Support in Creative Writing. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 245, 15 pages. https://doi.org/10.1145/3544548.3580782
- Content Planning for Neural Story Generation with Aristotelian Rescoring. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4319–4338. https://doi.org/10.18653/v1/2020.emnlp-main.351
- Joy Paul Guilford. 1967. The nature of human intelligence. (1967).
- Norman Norwood Holland. 2009. Literature and the Brain. PsyArt Foundation.
- Creative writing with an ai-powered writing assistant: Perspectives from professional writers. arXiv preprint arXiv:2211.05030 (2022).
- Fredric Jameson. 1991. Postmodernism, or, the cultural logic of late capitalism. Duke university press.
- The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 1265–1285. https://doi.org/10.18653/v1/2021.emnlp-main.97
- Essentials of creativity assessment. John Wiley & Sons.
- The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work. 1301–1318.
- Maria Kochis. 2007. Baxter, Charles. The Art of Subtext: Beyond Plot. Library Journal 132, 14 (2007), 135–136.
- LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond. arXiv preprint arXiv:2305.14540 (2023).
- CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 388, 19 pages. https://doi.org/10.1145/3491102.3502030
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634 (2023).
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651 (2023).
- Michael S. Matell and Jacob Jacoby. 1971. Is There an Optimal Number of Alternatives for Likert Scale Items? Study I: Reliability and Validity. Educational and Psychological Measurement 31, 3 (1971), 657–674. https://doi.org/10.1177/001316447103100307 arXiv:https://doi.org/10.1177/001316447103100307
- Tim Mayers. 2007. (Re) Writing craft: composition, creative writing, and the future of English studies. University of Pittsburgh Pre.
- Individual characteristics and creativity in the marketing classroom: Exploratory insights. Journal of Marketing Education 25, 2 (2003), 143–149.
- Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 355, 34 pages. https://doi.org/10.1145/3544548.3581225
- Donald M Murray. 2012. The craft of revision. Cengage Learning.
- WearWrite: Crowd-assisted writing from smartwatches. In Proceedings of the 2016 CHI conference on human factors in computing systems. 3834–3846.
- Collaborative Storytelling with Large-Scale Neural Language Models. In Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games (Virtual Event, SC, USA) (MIG ’20). Association for Computing Machinery, New York, NY, USA, Article 17, 10 pages. https://doi.org/10.1145/3424636.3426903
- Martha Nussbaum. 1997. Poetic justice: The literary imagination and public life. Beacon Press.
- OpenAI. 2022. ChatGT: Optimizing language models for dialogue. (2022). https://openai.com/blog/chatgpt/
- OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
- James Phelan. 1996. Narrative as rhetoric: Technique, audiences, ethics, ideology. Ohio State University Press.
- Assessment of creativity. The Cambridge handbook of creativity (2010), 48–73.
- Can foundation models label data like humans? Hugging Face Blog (2023). https://huggingface.co/blog/llm-leaderboard.
- PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4274–4295. https://doi.org/10.18653/v1/2020.emnlp-main.349
- Alicia RodrÃguez. 2008. The ‘problem’of creative writing: using grading rubrics based on narrative theory as solution. New Writing 5, 3 (2008), 167–177.
- Melissa Roemmele and Andrew Gordon. 2018a. Linguistic Features of Helpfulness in Automated Support for Creative Writing. In Proceedings of the First Workshop on Storytelling. Association for Computational Linguistics, New Orleans, Louisiana, 14–19. https://doi.org/10.18653/v1/W18-1502
- Melissa Roemmele and Andrew S Gordon. 2018b. Automated assistance for creative writing with an rnn language model. In Proceedings of the 23rd international conference on intelligent user interfaces companion. 1–2.
- Assessing creativity with divergent thinking tasks: exploring the reliability and validity of new subjective scoring methods. Psychology of Aesthetics, Creativity, and the Arts 2, 2 (2008), 68.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
- David R Thomas. 2006. A general inductive approach for analyzing qualitative evaluation data. American journal of evaluation 27, 2 (2006), 237–246.
- Ellis Paul Torrance. 1966. Torrance tests of creative thinking: Norms-technical manual: Verbal tests, forms a and b: Figural tests, forms a and b. Personal Press, Incorporated.
- Development of Torrance test creativity thinking (TTCT) instrument in science learning. In AIP Conference Proceedings, Vol. 2194. AIP Publishing.
- Maryam Vaezi and Saeed Rezaei. 2019. Development of a rubric for evaluating creative writing: a multi-phase research. New Writing 16, 3 (2019), 303–317.
- Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks. arXiv preprint arXiv:2306.07899 (2023).
- Large Language Models Enable Few-Shot Clustering. arXiv preprint arXiv:2307.00524 (2023).
- Creative cognition. Handbook of creativity 189 (1999), 212.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Sara Cushing Weigle. 2002. Assessing writing. Cambridge University Press.
- DOC: Improving Long Story Coherence With Detailed Outline Control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 3378–3465. https://doi.org/10.18653/v1/2023.acl-long.190
- Re3: Generating Longer Stories With Recursive Reprompting and Revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 4393–4479. https://doi.org/10.18653/v1/2022.emnlp-main.296
- Plan-and-Write: Towards Better Automatic Storytelling. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence (Honolulu, Hawaii, USA) (AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 906, 8 pages. https://doi.org/10.1609/aaai.v33i01.33017378
- Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces. 841–852.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.