Papers
Topics
Authors
Recent
Search
2000 character limit reached

Creative Beam Search: LLM-as-a-Judge For Improving Response Generation

Published 30 Apr 2024 in cs.AI, cs.CL, cs.HC, and cs.LG | (2405.00099v4)

Abstract: LLMs are revolutionizing several areas, including artificial creativity. However, the process of generation in machines profoundly diverges from that observed in humans. In particular, machine generation is characterized by a lack of intentionality and an underlying creative process. We propose a method called Creative Beam Search that uses Diverse Beam Search and LLM-as-a-Judge to perform response generation and response validation. The results of a qualitative experiment show how our approach can provide better output than standard sampling techniques. We also show that the response validation step is a necessary complement to the response generation step.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. 2019. Gradio: Hassle-free sharing and testing of ML models in the wild. arXiv:1906.02569 [cs.LG].
  2. Amabile, T. M. 1983. The social psychology of creativity: A componential conceptualization. Journal of Personality and Social Psychology 45(2):357–376.
  3. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073 [cs.CL].
  4. 2020. Bridging generative deep learning and computational creativity. In Proceedings of the 11th International Conference on Computational Creativity (ICCC’20).
  5. 2021. On the opportunities and risks of foundation models. arXiv:2108.07258 [cs.LG].
  6. 2023. Quality-Diversity through AI feedback. arXiv:2310.13032 [cs.CL].
  7. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NIPS’20).
  8. 2020. Language GANs falling short. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20).
  9. 2024. Self-play fine-tuning converts weak language models to strong language models. arXiv:2401.01335 [cs.LG].
  10. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL’23).
  11. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (NeurIPS’17).
  12. 2023. Quality diversity through human feedback. In Proceedings of the NeurIPS’23 ALOE Workshop.
  13. 2023. On the creativity of large language models. arXiv:2304.00008 [cs.AI].
  14. 2024. Creativity and machine learning. ACM Computing Surveys. Accepted for Publication. To Appear.
  15. 2023. Pushing GPT’s creativity to its limits: Alternative Uses and Torrance Tests. In Proceedings of the 14th International Conference on Computational Creativity (ICCC’23).
  16. 2017. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17).
  17. 2020. The curious case of neural text degeneration. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20).
  18. 2023. RLAIF: Scaling reinforcement learning from human feedback with AI feedback. arXiv:2309.00267 [cs.CL].
  19. 2023. Is AI art another industrial revolution in the making? In Proceedings of the AAAI’23 Creative AI Across Modalities Workshop.
  20. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL].
  21. 2018. Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning (ICML’18).
  22. 2023. Leveraging human preferences to master poetry. In Proceedings of the AAAI’23 Workshop on Creative AI Across Modalities.
  23. 2023a. Bits of Grass: Does GPT already know how to write like Whitman? In Proceedings of the 14th International Conference on Computational Creativity (ICCC’23).
  24. 2023b. On the power of special-purpose GPT models to create and evaluate new poetry in old styles. In Proc. of the 14th International Conference on Computational Creativity (ICCC’23).
  25. Shanahan, M. 2024. Talking about large language models. Communications of the ACM 67(2):68–79.
  26. 2022. Putting GPT-3’s creativity to the (Alternative Uses) Test. In Proceedings of the 13th International Conference on Computational Creativity (ICCC’22).
  27. 2023. Brainstorm, then select: a generative language model improves its creativity score. In Proceedings of the AAAI’23 Workshop on Creative AI Across Modalities.
  28. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 [cs.CL].
  29. 2018. Diverse beam search for improved description of complex scenes. Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18).
  30. 2023. Large language models are not fair evaluators. arXiv:2305.17926 [cs.CL].
  31. 2022. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT’22).
  32. 2023. Self-evaluation guided beam search for reasoning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NIPS’23).
  33. 2024. Self-rewarding language models. arXiv:2401.10020 [cs.CL].
  34. 2023. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Proceedings of the 37th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NIPS’23).
Citations (4)

Summary

  • The paper introduces Creative Beam Search (CBS), which integrates Diverse Beam Search with an LLM-as-a-Judge mechanism to enhance the creativity of generated responses.
  • CBS employs a multi-step process that mimics human creativity by using self-evaluation to select outputs based on subjective preference rather than mere probability.
  • Experimental results demonstrate that CBS improved creative preference by 45% compared to standard sampling, highlighting its potential in advancing computational creativity.

Creative Beam Search: LLM-as-a-Judge for Improving Response Generation

Introduction

The paper, "Creative Beam Search: LLM-as-a-Judge For Improving Response Generation," presents a novel methodology designed to bridge the gap between LLMs and human-like creativity. Traditional generative models often fall short in capturing elements of human creativity due to their inherent lack of intentionality and absence of a systematic creative process. The authors propose a method named Creative Beam Search (CBS) that leverages Diverse Beam Search (DBS) alongside LLM-as-a-Judge to enhance both response generation and validation phases. Through qualitative experiments, CBS is demonstrated to produce responses that are subjectively judged to be more creative than those generated by conventional sampling techniques.

Creative Beam Search Methodology

The Creative Beam Search method is inspired by the componential model of creativity, which involves steps such as task presentation, preparation, response generation, and response validation. CBS incorporates these steps by:

  1. Using Diverse Beam Search to simulate the response generation phase, promoting diversity among generated solutions.
  2. Implementing an LLM-as-a-Judge mechanism to conduct a self-evaluation and select the final output based on preference rather than mere probability maximization.

Response Generation: CBS employs Diverse Beam Search, partitioning the beam budget into groups to ensure diverse candidates. This method aims to go beyond traditional beam search's tendencies to converge on a narrow set of candidates, thereby fostering creativity-oriented diversity.

Response Validation: The CBS method uses LLM-as-a-Judge for self-assessment, allowing the model to rank candidates and mitigate positional bias through balanced position calibration. The candidate with the highest cumulative preference is selected as the final output. Figure 1

Figure 1: The Creative Beam Search method. Given a user prompt (step 0), DBS samples K candidate solutions from a pre-trained LLM (step 1). Then, K evaluative prompts are composed by altering the order of the candidates and are passed to the model as inputs (step 2). The candidate with the most preferences is finally outputted.

Experiments and Results

The experimental setup assessed CBS with a 7B parameter variant of Llama 2, using a RLHF-tuned version. The evaluation involved graduate students providing prompts and selecting more creative outputs between CBS-generated and standard outputs.

Setup: The pre-trained model was initialized with constraints such as a beam budget of 8, a diversity scaling factor of 10, and a top-K candidate selection process for evaluation. Candidates were generated with Diverse Beam Search and self-evaluated via LLM-as-a-Judge to determine creativity.

Findings: CBS was preferred in 45% of cases, showing a noticeable improvement over standard sampling. Despite the similarities in outputs sometimes leading to difficulty in differentiation, the distinct advantage of self-evaluation in refining response creativity was evident. Figure 2

Figure 2: The interface presented to the end-users during our experiment. After inserting a prompt with a creative request, two options are shown in a random order: the CBS output and the standard sampling output. The user is then asked to indicate which is the most creative in their opinion (or if the two options are too similar to decide).

The study also revealed that self-evaluation meaningfully altered the choice amongst candidates, with CBS achieving distinct outcomes compared to the naive application of DBS. Figure 3

Figure 3: Percentage of end-users' preferences comparing when CBS output is equal to DBS output and when it is not.

Discussion

While CBS represents a step towards aligning generative models with creative processes, significant challenges persist. Diverse Beam Search's reliance on Hamming diversity might result in sequences that are still overly similar. The LLM-as-a-Judge paradigm, despite its advantages, does not emulate genuine intentional evaluation processes due to the inherent nature of LLMs lacking consciousness.

Future exploration could focus on extending this framework to more sophisticated LLMs or incorporating broader sets of diverse candidates for evaluation. Furthermore, aligning CBS with models specifically fine-tuned for creativity could provide deeper insights into potential gains in the field of computational creativity.

Conclusion

Creative Beam Search offers a promising approach towards incorporating creativity-oriented mechanisms in LLM response generation, as evidenced by qualitative preferences amongst users. Although challenges remain, the potential for CBS to enhance creative collaboration with AI systems suggests fruitful avenues for future research in computational creativity and generative modeling.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 2 likes about this paper.