Papers
Topics
Authors
Recent
Search
2000 character limit reached

Divergent-Convergent Thinking in Large Language Models for Creative Problem Generation

Published 29 Dec 2025 in cs.AI | (2512.23601v1)

Abstract: LLMs have significant potential for generating educational questions and problems, enabling educators to create large-scale learning materials. However, LLMs are fundamentally limited by the ``Artificial Hivemind'' effect, where they generate similar responses within the same model and produce homogeneous outputs across different models. As a consequence, students may be exposed to overly similar and repetitive LLM-generated problems, which harms diversity of thought. Drawing inspiration from Wallas's theory of creativity and Guilford's framework of divergent-convergent thinking, we propose CreativeDC, a two-phase prompting method that explicitly scaffolds the LLM's reasoning into distinct phases. By decoupling creative exploration from constraint satisfaction, our method enables LLMs to explore a broader space of ideas before committing to a final problem. We evaluate CreativeDC for creative problem generation using a comprehensive set of metrics that capture diversity, novelty, and utility. The results show that CreativeDC achieves significantly higher diversity and novelty compared to baselines while maintaining high utility. Moreover, scaling analysis shows that CreativeDC generates a larger effective number of distinct problems as more are sampled, increasing at a faster rate than baseline methods.

Summary

  • The paper presents CreativeDC, a two-phase mechanism that decouples divergent ideation from convergent constraint satisfaction to enhance creative problem generation.
  • It employs persona simulation and iterative selection, achieving up to 72% higher semantic diversity over baseline methods without compromising solution utility.
  • The study utilizes lexical and semantic metrics to quantify diversity and novelty, demonstrating the method’s promise for automated educational and creative content generation.

Divergent-Convergent Thinking in LLMs for Creative Problem Generation

Introduction and Motivation

LLMs have achieved proficiency in automatic problem generation for educational applications but are subject to an "Artificial Hivemind" effect, marked by intramodel repetition and intermodel homogeneity. This effect leads to the generation of highly similar, low-diversity problems, even as the underlying models increase in scale and capability. Prior work has established that methods such as naive decoding temperature increases, persona simulation, or contextualized prompting only marginally mitigate this limitation and often fail to fundamentally alter the solution space explored by LLMs [hivemind, DBLP:conf/iclr/LuSHM0HEJCD025].

This paper introduces CreativeDC, a two-phase prompting mechanism for LLM-based creative problem generation. CreativeDC operationalizes Wallas’s theory of staged creativity and Guilford’s framework for divergent–convergent cognitive modes, decoupling unconstrained ideation (divergent thinking) from the phase of constraint satisfaction (convergent thinking). The method thus scaffolds reasoning trajectories in a manner consistent with established human creativity theories, enabling exploration of a broader semantic manifold prior to commitment to a constraint-compliant solution. Figure 1

Figure 1

Figure 1

Figure 1: CreativeDC employs a two-phase reasoning process—divergent ideation followed by convergent constraint satisfaction—to scaffold creative programming problem generation.

Methodology

CreativeDC decomposes prompt-driven problem generation into distinct divergent and convergent thinking stages. In the divergent phase, the LLM is instructed to enumerate unconventional, thematically relevant entities or situations, eschewing all task constraints. Prompts explicitly reinforce the generation of unorthodox, out-of-distribution ideas (e.g., “list down wildly different and underexplored scenarios relevant to the theme”), thereby suppressing the drive for immediate constraint satisfaction.

In the convergent phase, the model is tasked with selecting a single idea and transforming it into a fully specified problem that meets explicit requirements (e.g., target programming concepts). This step is iterative—the model is instructed to resample from the divergent list if a given idea cannot be reconciled with structural constraints (such as requiring only Python lists).

Persona simulation further augments CreativeDC by inserting a sampled persona description from the Persona Hub dataset into the prompt, leveraging prior results indicating increased output diversity via persona role-play [DBLP:journals/corr/abs-2406-20094].

Evaluation Framework

The effectiveness of CreativeDC is assessed in the domain of programming problem generation along three axes:

  • Diversity: Lexical diversity is computed as the ratio of unique nn-grams to all nn-grams across a sample, while semantic diversity is the mean pairwise embedding cosine distance within a set of KK generated problems.
  • Novelty: Lexical novelty measures out-of-corpus nn-gram rate with a strong in-domain reference set, while semantic novelty uses the average minimal semantic distance to any reference problem.
  • Utility: Utility measures aggregated binary indicators for solution validity (test suite correctness), contextual relevance, and comprehensibility.

The primary LLM is Qwen3-235B-A22B-Instruct-2507 [DBLP:journals/corr/abs-2505-09388], semantic metrics employ Qwen3-Embedding-0.6B [DBLP:journals/corr/abs-2506-05176], and utility is standardized using Gemini 2.5 Flash-Lite as an LLM-as-a-judge [DBLP:journals/corr/abs-2507-06261]. CreativeDC and baselines (Base, Chain-of-Thought) are compared on 20 contexts (4 themes ×\times 5 program concepts) with K=100K=100 generations each.

Experimental Results

CreativeDC yields statistically significant improvements in both diversity and novelty relative to all baselines. In the setting without persona simulation, CreativeDC achieves +16.7% higher semantic diversity and +63.5% higher semantic novelty versus the Chain-of-Thought baseline (all p<0.001p<0.001, Mann-Whitney U test). With persona simulation, these advantages are slightly moderated but remain substantial (+8.5% diversity, +32.9% novelty over COT). Figure 2

Figure 2: Kernel density estimates for per-problem semantic diversity and novelty; CreativeDC shows both higher means and lower variance, indicating reliable creativity improvements.

Notably, CreativeDC induces no statistically significant utility degradation compared to the baselines. This establishes that explicit divergent-convergent scaffolding enhances creative breadth without compromising correctness or contextual adequacy.

Scaling analysis with the Vendi Score [DBLP:journals/tmlr/FriedmanD23] demonstrates that CreativeDC generates a larger effective number of semantically distinct problems as KK increases (24.0% higher than COT at K=10K=10; 72.0% higher at K=100K=100). Figure 3

Figure 3: Effective number of distinct generated problems (Vendi score) as a function of sample size KK; CreativeDC advances more rapidly to higher diversity.

Contextual analysis shows that creativity and utility trade off: mundane themes (e.g., "Cooking") yield high utility but limited diversity, while open-ended themes (e.g., "Superheroes") facilitate greater unconstrained creative expressiveness. Figure 4

Figure 4: Per-context performance breakdown showing that creativity is maximized in open-ended themes and simple programming concepts.

Theoretical and Practical Implications

CreativeDC provides an operational demonstration that explicitly structuring LLM prompting according to psychological models of creativity significantly enhances output originality and effective diversity, supporting the hypothesis that LLM output homogeneity is partially due to premature constraint convergence in standard prompts. As a method, it is modular, requires no model fine-tuning, and can be composed with other generative diversity-inducing interventions (e.g., persona role-play, multi-LLM debate frameworks).

This approach suggests that future LLM evaluation and deployment in creative, open-ended tasks should incorporate explicit cognitive process scaffolding at inference time. The results also highlight limitations of purely surface-driven diversity enhancements—such as temperature sampling—which fail to access novel solution regions without explicit ideation phases [lu2024discussion].

Potential extensions include application to broader creative domains (narrative, poetic, design generation), evaluation with direct human creativity ratings, and adaptation for multistage or multiagent collaborative creativity settings [wen2025explorationvsfixationscaffolding]. There is also scope for model-agnostic adoption, with likely positive transfer to other architectures and scales.

Conclusion

CreativeDC establishes that divergent-convergent scaffolding at the prompt level robustly elevates the creative capacity of LLMs for problem generation, as quantified by both semantic and lexical novelty/diversity metrics, without reduction in deterministic utility. The findings have immediate relevance for the design of automated educational content pipelines and broader creative generation tasks, and motivate further investigation of inference-time reasoning process control to counteract the inherent "hivemind" effect of current LLMs.

(2512.23601)

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

Overview

This paper is about helping AI tools (LLMs, or LLMs) make more creative and varied educational problems, especially programming tasks. The authors noticed that many AI-generated problems end up sounding very similar, which can be boring and limit students’ thinking. They propose a new way to prompt the AI, called CreativeDC, that guides the model to first brainstorm wild ideas and then carefully pick and shape one idea into a solid, useful problem.

What are the main questions?

The researchers wanted to find out:

  • Can a two-step thinking process (brainstorm first, refine later) make AI-generated problems more diverse and original?
  • Will these more creative problems still be useful and clear for students?
  • Does this approach scale well—meaning, does it keep producing many distinct problems as you generate more and more?

How did they do it?

The core idea: Two-phase thinking

Think of creativity like building a story:

  • Divergent phase: First, you brainstorm lots of unusual, surprising ideas without worrying about rules. It’s like throwing paint on a canvas to see what patterns appear.
  • Convergent phase: Next, you pick the best idea and shape it carefully to fit the rules (like the required programming concept). If it doesn’t fit, you try another idea.

CreativeDC tells the AI to:

  1. Explore the theme freely (e.g., “Superheroes”) and list unconventional ideas.
  2. Choose one idea and turn it into a valid programming problem that matches the required concept (e.g., “Lists”), including a description, tests, and a solution.

They also tried “persona simulation,” which is like asking the AI to write from different points of view (wearing different “hats”) to boost variety—like thinking as a historian, a comedian, or a game designer.

What did they measure?

To judge creativity and usefulness, they used simple, practical checks:

  • Diversity: How different the problems are from each other.
    • Lexical (word-level) diversity: Do they use varied wording?
    • Semantic (meaning-level) diversity: Do the ideas differ in substance, not just wording? They measure this by turning each problem into numbers (embeddings) and checking how far apart they are, like measuring distance between planets in space.
  • Novelty: How different the problems are from problems made by other methods.
    • Lexical novelty: Do the problems use fresh phrases not found in others?
    • Semantic novelty: Is each problem meaningfully far from its most similar neighbor in the other methods’ outputs?
  • Utility (quality): Are the problems valid, relevant to the theme and required concept, and understandable? They use a strong AI “judge” and code tests to check this.

They also used a “Vendi score,” which is like asking, “How many truly distinct problems are in this set?” A higher Vendi score means more unique ideas, not just slight variations.

What did they test?

  • Themes: Cooking, Science Fiction, Superheroes, Board Games.
  • Programming concepts: Variables, Selection Statements (if/else), Loops, Lists, Strings.
  • They generated 100 problems per theme–concept pair for each method and compared CreativeDC to two baselines: a standard prompt (“Base”) and a “think step by step” prompt (Chain of Thought, “CoT”).

What did they find and why is it important?

  • CreativeDC made problems that were more diverse and more original than the baselines—both in wording and in meaning. This held true even without personas, and got even better with personas.
  • Importantly, utility (quality and correctness) stayed high and about the same as the baselines. So the problems weren’t just weird—they were usable and clear.
  • As they generated more problems, CreativeDC kept producing more truly distinct ones. Its Vendi score grew faster than the baselines, meaning it scales well for large sets of exercises.
  • Themes and concepts matter:
    • Familiar themes like Cooking had higher utility but lower diversity/novelty (likely because they’re more constrained and predictable).
    • Creative themes like Science Fiction or Superheroes boosted diversity and novelty.
    • Simpler concepts (like Variables) gave more room for creativity than more complex ones (like Loops or Lists).

This is important because it shows a simple change in how we prompt AI—separating brainstorming from rule-following—can beat the “Artificial Hivemind” effect (where AIs tend to produce similar answers) and keep educational content fresh.

What does this mean going forward?

  • For teachers and course creators: Using CreativeDC can help build large collections of programming problems that feel different and interesting, keeping students engaged and encouraging creative thinking.
  • For AI design: Structuring the model’s thinking (first explore, then refine) can improve creativity without retraining the model or needing expensive datasets.
  • Beyond programming: The same idea could help in story writing, design tasks, or art prompts—anywhere you want both creativity and usefulness.

The authors note a few next steps: test this approach on more kinds of AI models, run human studies to confirm that students find the problems creative and helpful, and try it in other creative domains.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list identifies what remains missing, uncertain, or unexplored in the paper, framed as concrete items future researchers can address:

  • Generalizability across models: The approach is evaluated with a single generator (Qwen3-235B). It is unknown whether CreativeDC’s gains hold across different architectures (decoder-only vs MoE), sizes, pretraining corpora, and RLHF alignment levels.
  • Cross-model “Artificial Hivemind” mitigation: The paper cites cross-model homogeneity but measures only one generator; there is no evidence that CreativeDC reduces homogeneity across multiple LLMs used as generators.
  • Domain transfer beyond programming: Applicability to creative tasks in story writing, poetry, design, or math word problems is untested; it is unclear what prompt adaptations are required, and whether gains in novelty/diversity persist.
  • Human-grounded evaluation: Creativity and utility are judged primarily by automated metrics and an LLM-as-a-judge; there is no human study to validate perceived originality, educational value, clarity, and appropriateness.
  • Robustness of semantic metrics: Semantic diversity/novelty rely on a single embedding model; sensitivity analyses with alternative embeddings (e.g., multilingual, code-aware, sentence vs document-level) are missing.
  • Metric scope ambiguity: It is unclear which components are embedded or n-grammed (problem description only vs description+tests+solution vs reasoning traces), risking confounds if CreativeDC’s divergent/convergent traces are included.
  • n-gram settings and preprocessing: The paper does not specify n values, tokenization, case normalization, or stopword handling; results may be sensitive to these choices.
  • Novelty reference corpus choice: Novelty is computed against other methods’ outputs for the same context; performance against real-world corpora (e.g., public programming problem banks, textbooks) is untested.
  • Utility verification rigor: “Requires only concept X” is judged by an LLM; there is no static analysis or automated code instrumentation to enforce/disallow constructs (e.g., dicts, classes), nor coverage analysis for test suites.
  • LLM-as-a-judge reliability: No calibration, agreement analysis, or benchmarking of judge model decisions against human annotations; error modes (false positives/negatives in validity and context relevance) are unknown.
  • Pre-filtering bias and cost: The generation consistency check discards inconsistent problems until K=100, but acceptance rates, retries, and computational cost per method are not reported; this may bias comparisons or mask efficiency trade-offs.
  • Efficiency and latency: Token counts, runtime, and inference cost for the two-phase process vs baselines are not measured; real-world feasibility in classroom-scale generation remains unclear.
  • Decoding hyperparameters: Only temperature=1.0 is used; interactions between CreativeDC and sampling strategies (top-k/top-p, repetition penalties) are unexplored.
  • Divergent-phase design: The number, breadth, and structure of brainstormed ideas are not controlled or ablated; the impact of idea count, prompt framing, and time spent in divergence on final creativity is unknown.
  • Convergent-phase selection mechanism: There is no explicit criterion or re-ranking strategy for choosing the “best” idea; integrating self-evaluation, multi-idea synthesis, or external scoring might improve outcomes, but is untested.
  • Persona simulation effects: Personas are uniformly sampled, but the per-persona impact on creativity/utility is not analyzed; potential bias, stereotyping, or inappropriate content arising from certain personas is not assessed.
  • Context sensitivity and control: While some themes increase novelty at the cost of utility, there is no mechanism to adjust the “creativity intensity” (a controllable knob) or to adapt prompts to specific contexts to balance the trade-off.
  • Educational impact: The claim that diversity benefits learners is not empirically validated; effects on engagement, learning gains, cognitive load, and fairness are unmeasured.
  • Language and platform coverage: The study focuses on Python and English; transfer to other programming languages, paradigms (functional, object-oriented), and non-English instructions/problems is untested.
  • Multi-concept and complexity: Only single-concept problems are evaluated; performance on multi-concept tasks, data structures, algorithmic complexity, and open-ended project briefs is unknown.
  • Safety and appropriateness: There is no analysis of potentially harmful, biased, or age-inappropriate content, especially when using personas or creative themes; content filters or guardrails are not discussed.
  • Statistical reporting: Significance tests are provided, but effect sizes, confidence intervals for key differences, and corrections for multiple comparisons are absent; reproducibility of random persona sampling and seeds is unclear.
  • Release and reproducibility: The paper references prompt templates, but does not state whether full code, prompts, seeds, and generated datasets are released to enable replication and independent evaluation.

Practical Applications

Immediate Applications

The following applications can be deployed with today’s LLMs by wrapping existing generation workflows with the paper’s divergent–convergent (DC) prompting and by adopting its diversity/novelty/utility evaluation stack.

  • Diverse programming exercise generation in LMS/EdTech
    • Sector: education, software
    • Use case: Auto-generate large, diverse sets of programming problems (description + test suite + sample solution) that practice specific concepts (e.g., loops, lists) while avoiding repetition.
    • Tools/products/workflows: LTI plugin for Canvas/Moodle; GitHub Classroom and CodeRunner integration; API that returns N distinct items per concept/topic using CreativeDC; built-in utility checks via code execution and LLM-as-judge.
    • Assumptions/dependencies: Access to a capable LLM and a code execution sandbox; light content review for curriculum fit; per-student data handling compliant with privacy policies.
  • Question bank diversification and de-duplication audits
    • Sector: education, assessment/psychometrics
    • Use case: Audit existing item banks for “Artificial Hivemind” collapse; remove near-duplicates; backfill gaps with high-novelty items measured by SemDiv/SemNov and Vendi.
    • Tools/products/workflows: One-click “diversity audit” dashboard; embeddings-based clustering; Vendi score targets; CreativeDC sampler to fill clusters with sparse coverage.
    • Assumptions/dependencies: Embedding model availability; item ownership/IP cleared for transformation; psychometric re-calibration may be needed after edits.
  • Secure multi-form exam variant generation
    • Sector: education, assessment
    • Use case: Create many parallel forms of the same assessment constraints to reduce item exposure and cheating risk without sacrificing utility.
    • Tools/products/workflows: CreativeDC-controlled generation with utility gating (pass/fail on tests); SemNov thresholds to ensure non-overlap; export to QTI/CSV.
    • Assumptions/dependencies: Human validation for high-stakes exams; policy alignment with test security standards.
  • Corporate L&D technical training content
    • Sector: HR/L&D, software
    • Use case: Scenario-based coding katas and knowledge checks tailored to tech stacks; produce diverse practice to improve engagement.
    • Tools/products/workflows: Team portal that ingests learning objectives, emits batches of varied exercises; persona augmentation to mirror different user profiles or industries.
    • Assumptions/dependencies: Domain customization prompts; internal review; data governance for proprietary contexts.
  • Coding challenge platforms and hackathons
    • Sector: software, developer platforms
    • Use case: Generate fresh, non-redundant challenges with validated tests, reducing content authoring bottlenecks.
    • Tools/products/workflows: Scheduled CreativeDC jobs; novelty thresholds; automatic difficulty tagging using solution telemetry; moderation queue.
    • Assumptions/dependencies: Execution sandbox; abuse/malware checks in sample solutions; fairness/clarity review.
  • Instructor co-pilot for creative problem authoring
    • Sector: education
    • Use case: Teachers brainstorm unusual contexts in a divergent pass, then converge to a constrained, solvable exercise—mirroring the paper’s scaffold.
    • Tools/products/workflows: Authoring UI that displays divergent idea lists, supports “try another idea,” and shows live utility checks; persona toggles to broaden perspectives.
    • Assumptions/dependencies: Teacher time and preferences; local content standards.
  • Student-facing daily practice with novelty protections
    • Sector: education, consumer learning
    • Use case: Personalized “problem of the day” streams that minimize repetition over weeks; track personal Vendi scores to ensure variety.
    • Tools/products/workflows: Mobile/app integration; novelty budget per learner; lightweight judge model for on-device or server-side validation.
    • Assumptions/dependencies: Cost controls for daily generation; safe-mode filters for themes; opt-in telemetry.
  • Editorial/content teams: diversified copy within constraints
    • Sector: content marketing, media
    • Use case: Generate many distinct but on-brief variations (headlines, blurbs, calls-to-action) by decoupling ideation and requirement satisfaction.
    • Tools/products/workflows: CreativeDC prompt wrappers for briefs; semantic novelty targets to avoid near-duplicates; A/B testing harness.
    • Assumptions/dependencies: Brand safety/approval workflows; legal review for claims.
  • Requirements and user-story ideation
    • Sector: software/product
    • Use case: Divergent exploration of solution concepts/user stories followed by convergent filtering against engineering constraints or acceptance criteria.
    • Tools/products/workflows: Backlog assistant that proposes varied user stories; semantic clustering to reduce redundancy; “converge” step produces INVEST-compliant stories.
    • Assumptions/dependencies: Team calibration; integration with Jira/Linear.
  • Creativity metrics as a quality gate in AI pipelines
    • Sector: platform engineering, MLOps
    • Use case: Add SemDiv/SemNov/Vendi thresholds to CI/CD for content generation; block deploys that regress diversity/novelty; track utility rate.
    • Tools/products/workflows: Embedding service; nightly regression tests; dashboard with per-batch creativity KPIs.
    • Assumptions/dependencies: Stable embedding model choice; monitoring and alerting infra; acceptance thresholds tuned to domain.
  • Lightweight “Diversity-as-a-Wrap” API
    • Sector: software, AI tooling
    • Use case: A drop-in wrapper around any LLM provider that runs the two-phase prompt, batches generation, and returns filtered, diverse outputs with metrics.
    • Tools/products/workflows: REST/SDK; persona sampling; rate-limit and cost controls; retry-on-utility-fail.
    • Assumptions/dependencies: Provider-agnostic prompt compatibility; token budget and latency acceptable for batch generation.
  • Faculty/research lab dataset creation
    • Sector: academia
    • Use case: Generate creative, diverse benchmarks (e.g., programming tasks) with built-in ground truth to study LLM problem solving and creativity trade-offs.
    • Tools/products/workflows: Reproducible CreativeDC scripts; published SemNov/SemDiv/Vendi statistics per release; open-source evaluation harness.
    • Assumptions/dependencies: Model access; compute budget; licensing for dataset release.

Long-Term Applications

These scenarios require further validation, scaling, or standardization, but are natural extensions of the paper’s findings.

  • Diversity standards for AI-generated educational content
    • Sector: policy, education
    • Use case: Procurement/accreditation guidelines that mandate creativity audits (e.g., minimum Vendi/SemNov) for large-scale item banks in public education.
    • Tools/products/workflows: Third-party audit services; compliance reports; reference thresholds per grade/subject.
    • Assumptions/dependencies: Consensus on metrics and thresholds; stakeholder adoption; fairness and accessibility reviews.
  • Personalized creativity coach for learners and creators
    • Sector: education, consumer apps, writing/design tools
    • Use case: An interactive tutor that teaches divergent–convergent thinking by scaffolding human ideation (not just the model’s), offering novelty feedback and iteration.
    • Tools/products/workflows: Mixed-initiative UI; creativity rubrics; reflections on idea breadth; gamified “novelty streaks.”
    • Assumptions/dependencies: Human-subjective measures of creativity; UX research; content safety and age appropriateness.
  • Cross-domain DC generation (stories, design briefs, math proofs, hypotheses)
    • Sector: media, design, science
    • Use case: Use CreativeDC to generate story arcs, visual briefs, proof sketches, or research hypotheses that first explore many directions before converging to constraints.
    • Tools/products/workflows: Plugins for Figma/Notion/Docs; domain-specific utility checks (e.g., logical consistency, citation grounding).
    • Assumptions/dependencies: Domain validators beyond code execution; stronger reasoning and fact-checking models.
  • Psychometrics-aware, diversity-constrained adaptive testing
    • Sector: assessment/psychometrics
    • Use case: CAT systems that factor novelty/diversity when selecting items, controlling exposure while maintaining IRT properties.
    • Tools/products/workflows: Dual-objective item selection (information + SemNov); post-generation calibration; secure item lifecycle pipelines.
    • Assumptions/dependencies: Sufficient item volumes; rigorous calibration; regulatory acceptance.
  • Agentic content factories with closed-loop creativity control
    • Sector: AI platforms, EdTech
    • Use case: Autonomous agents that generate, judge, execute, filter, and curate content at scale, maintaining diversity metrics over time.
    • Tools/products/workflows: Multi-stage pipelines; cost-aware sampling; drift detection on diversity; human-in-the-loop escalation.
    • Assumptions/dependencies: Reliable LLM-as-judge; robust sandboxing; governance to prevent error propagation.
  • Creativity-aware RLHF/fine-tuning
    • Sector: AI research, foundation models
    • Use case: Train models to internalize DC reasoning and optimize for novelty/diversity without utility loss.
    • Tools/products/workflows: Preference datasets for creativity; multi-objective RLHF; evaluation on SemDiv/SemNov/Vendi.
    • Assumptions/dependencies: High-quality human preference data; careful trade-off tuning to avoid unsafe or incoherent outputs.
  • Curriculum planning via “idea-space maps”
    • Sector: education
    • Use case: Visualize clusters/gaps in generated problems across themes/concepts; plan coverage and spiral learning with diversity-aware sequencing.
    • Tools/products/workflows: Embeddings-based maps; coverage analytics; generator that targets underrepresented clusters.
    • Assumptions/dependencies: Robust clustering; teacher adoption; alignment with standards (e.g., CS curricula).
  • Simulation scenario generation for safety-critical domains
    • Sector: robotics/autonomous driving, cybersecurity
    • Use case: Generate diverse corner-case scenarios (creative adversarial situations) via divergent exploration, then converge to simulatable constraints.
    • Tools/products/workflows: Scenario language templates; simulators-in-the-loop; novelty thresholds to avoid near-duplicate cases.
    • Assumptions/dependencies: High-fidelity simulators; domain-specific validators; safety governance.
  • Diversity certification and auditing services
    • Sector: policy, enterprise compliance
    • Use case: Certify vendors’ AI content for diversity/novelty/utility; periodic re-audits to prevent homogenization over time.
    • Tools/products/workflows: Standard test suites; reference corpora; signed reports; remediation playbooks.
    • Assumptions/dependencies: Market demand; regulator or industry body endorsement.
  • Marketplace for diversity-guaranteed content packs
    • Sector: EdTech, content marketplaces
    • Use case: Sell vetted “problem packs” or content sets with published Vendi/SemNov scores and utility rates for transparent quality.
    • Tools/products/workflows: Metadata manifests; versioning; buyer-side importers to LMS/platforms.
    • Assumptions/dependencies: IP rights; shared metric standards; curation labor.
  • Research benchmarks for measuring LLM creativity
    • Sector: academia, AI evaluation
    • Use case: Standardized tasks and metric suites for creativity that stress-test models’ tendency toward premature convergence.
    • Tools/products/workflows: Public corpora; leaderboards; cross-model comparisons; ablations on DC scaffolding.
    • Assumptions/dependencies: Community adoption; careful dataset design to avoid leakage.
  • Organization-wide “anti-hivemind” policies for AI content
    • Sector: enterprise policy
    • Use case: Internal guidelines that require two-phase prompting and creativity audits for generated learning, support, or marketing content.
    • Tools/products/workflows: Prompt libraries; governance checklists; periodic diversity score reporting.
    • Assumptions/dependencies: Change management; toolchain integration; training for content teams.

Notes on feasibility across applications

  • Model access and cost: Batch generation plus judging/embedding incurs tokens and latency; budget controls and caching are essential.
  • Reliability: Utility checks (LLM-as-judge, code execution) reduce but do not eliminate errors; high-stakes use requires human review.
  • Data privacy and safety: When personal or proprietary contexts are used, ensure compliant handling and robust safety filters.
  • Metrics choice: SemDiv/SemNov/Vendi depend on embedding models; metric drift can occur if embeddings change—pin versions and monitor.
  • Persona simulation: Improves diversity but may impact utility; tune persona mix and apply guardrails for sensitive contexts.

Glossary

  • Artificial Hivemind effect: A phenomenon where LLMs produce repetitive and homogeneous outputs across prompts and models. "LLMs are fundamentally limited by the ``Artificial Hivemind'' effect, where they generate similar responses within the same model and produce homogeneous outputs across different models."
  • CoT: Short for Chain-of-Thought prompting; a technique that elicits step-by-step reasoning before final answers. "The {CoT} adds a "Think step by step" instruction to the {Base} prompt."
  • Convergent thinking phase: The phase where ideas are narrowed and refined to meet constraints and produce a single solution. "(2) Convergent thinking phase: narrowing down to one idea and align it with the constraints (required programming concept) in the given context."
  • Cosine distance: A measure of dissimilarity between vectors based on the angle between them; used to quantify semantic differences. "To capture diversity beyond surface lexical overlap, we evaluate the semantic diversity of the set S\mathcal{S} as the average pairwise cosine distance between problem embeddings."
  • Cosine similarity: A measure of similarity between vectors given by their normalized dot product. "where Kij\mathbf{K}_{ij} is the cosine similarity between e(Pi)\mathbf{e}(\mathcal{P}_i) and e(Pj)\mathbf{e}(\mathcal{P}_j)."
  • Decoding temperature: A sampling parameter controlling randomness in generation; higher values increase variability. "A naive method is to use a high decoding temperature, which increases surface-level diversity but does not improve originality and can even reduce creativity~\cite{lu2024discussion}."
  • Divergent thinking phase: The phase focused on generating many varied, unconventional ideas without early constraint satisfaction. "(1) Divergent thinking phase: exploring diverse and novel ideas related to the theme."
  • Eigenvalues: Scalars characterizing a matrix’s linear transformation in eigendecomposition; used in diversity metrics like Vendi. "where λ1,,λK\lambda_1, \ldots, \lambda_K are the eigenvalues of K/K\mathbf{K}/K."
  • Functional fixedness: A cognitive bias that restricts creative use of familiar concepts or tools. "One of the reasons for this lack of creativity is ``functional fixedness,'' a cognitive bias that limits unconventional thinking."
  • Gaussian kernels: Smooth bell-shaped kernel functions used in nonparametric density estimation. "Figure \ref{fig:evaluation_semantic} visualizes the distribution of per-problem scores using kernel density estimation (KDE) with Gaussian kernels, where each method's distribution is independently normalized."
  • Greedy decoding: A deterministic decoding strategy that selects the highest-probability token at each step. "For utility evaluation, we use Gemini 2.5 Flash-Lite \cite{DBLP:journals/corr/abs-2507-06261} as the judge model with greedy decoding (i.e. temperature =0.0= 0.0)."
  • Kernel density estimation (KDE): A nonparametric method to estimate a probability density function from samples. "Figure \ref{fig:evaluation_semantic} visualizes the distribution of per-problem scores using kernel density estimation (KDE) with Gaussian kernels, where each method's distribution is independently normalized."
  • LLM-as-a-judge: An evaluation paradigm where an LLM assesses outputs for quality criteria. "We evaluate utility using an LLM-as-a-judge approach by using a strong LLM to assess context relevance and comprehensibility, and generates a solution program that we execute against PP to verify validity, following prior work~\cite{DBLP:conf/nips/ZhengC00WZL0LXZ23,hivemind,DBLP:conf/aied/NguyenPGTS25,padmakumar2025measuringllmnoveltyfrontier}."
  • Mann-Whitney U test: A nonparametric statistical test for comparing differences between two independent distributions. "In the without persona setting, {CreativeDC} achieves 16.7\% higher semantic diversity and 63.5\% higher semantic novelty than {CoT} (all p < 0.001, Mann-Whitney U test)."
  • Mixture-of-experts model: An architecture that routes inputs to specialized expert subnetworks for improved capacity and efficiency. "We use an open-source model, Qwen3-235B-A22B-Instruct-2507 \cite{DBLP:journals/corr/abs-2505-09388}, a state-of-the-art mixture-of-experts model as our generation model."
  • n-grams: Contiguous sequences of n items (e.g., tokens) used to analyze lexical patterns. "We measure lexical diversity as the ratio of unique nn-grams to the total number of nn-grams across the set."
  • Persona augmentation: Adding persona instructions to prompts to diversify outputs. "In our experiments in Section \ref{sec:evaluation}, we conduct an experiment to evaluate the effect of persona augmentation by applying the same persona simulation to all evaluated methods."
  • Persona simulation: Instructing an LLM to adopt a specific persona to influence its perspective and style. "To further enhance the diversity of generated problems, we augment {CreativeDC} with persona simulation."
  • Reinforcement Learning from Human Feedback (RLHF) alignment: Training/aligning LLMs using human preference signals via RL, often leading to safer but more average outputs. "Moreover, Reinforcement Learning from Human Feedback (RLHF) alignment amplifies this convergence toward statistically average responses, reducing creativity~\cite{DBLP:conf/iclr/LuSHM0HEJCD025} and producing lower-entropy distributions~\cite{zhang2025noveltybench}."
  • Semantic diversity: Variation in meaning across outputs, typically measured via embedding-space distances. "To capture diversity beyond surface lexical overlap, we evaluate the semantic diversity of the set S\mathcal{S} as the average pairwise cosine distance between problem embeddings."
  • Semantic novelty: The degree to which an output is semantically distinct from a reference corpus. "Semantic Novelty. To capture novelty beyond surface level, we compute the minimum cosine distance between a problem embedding and the embeddings of all problems in the reference corpus: $\text{SemNov}(\mathcal{P}, \mathcal{R}) = \min_{\mathcal{P}' \in \mathcal{R} d_{\cos}(\mathbf{e}(\mathcal{P}), \mathbf{e}(\mathcal{P}')).$"
  • Vendi Score: A diversity metric equal to the exponential of the entropy of the similarity matrix’s eigenvalue distribution; interpretable as the effective number of distinct items. "We report the Vendi Score~\cite{DBLP:journals/tmlr/FriedmanD23}, which can be interpreted as the effective number of distinct problems in a set."
  • Wilcoxon Signed-Rank Test: A nonparametric statistical test for paired samples to assess median differences. "In both settings without and with persona simulation, {CreativeDC} significantly outperforms all baselines on all lexical and semantic diversity and novelty metrics (all p<0.01p < 0.01, Wilcoxon Signed-Rank Test), while maintaining utility comparable to the baselines (p>0.01p > 0.01)."

Open Problems

We found no open problems mentioned in this paper.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 1237 likes about this paper.