TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks

Published 19 May 2023 in cs.AI, cs.CL, cs.IR, and cs.LG | (2305.11430v2)

Abstract: While LLMs have shown great success in understanding and generating text in traditional conversational settings, their potential for performing ill-defined complex tasks is largely under-studied. Indeed, we are yet to conduct comprehensive benchmarking studies with multiple LLMs that are exclusively focused on a complex task. However, conducting such benchmarking studies is challenging because of the large variations in LLMs' performance when different prompt types/styles are used and different degrees of detail are provided in the prompts. To address this issue, the paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks. This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study, enabling meaningful comparisons across different studies. Also, by establishing a common standard through this taxonomy, researchers will be able to draw more accurate conclusions about LLMs' performance on a specific complex task.

Abstract PDF Upgrade to Chat

Citations (37)

View on Semantic Scholar

Summary

The paper introduces TELeR as a standardized taxonomy that categorizes LLM prompts based on turn, expression, role, and level of detail.
The framework is demonstrated through practical use cases like meta-review generation and narrative braiding, showing improved benchmarking performance.
The study emphasizes the role of detailed prompt engineering in enhancing reproducibility and enabling meaningful comparisons in LLM evaluations.

TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks

The paper "TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks" (2305.11430) addresses the limitations present in the evaluation and benchmarking of LLMs when tasked with real-world complex challenges. Despite LLMs' proficiency in traditional tasks, there is a notable gap in the exploration and standardization of prompts suited for complex, often ill-defined tasks. This paper offers a structured taxonomy, TELeR, to overcome these evaluation challenges.

Introduction to TELeR Taxonomy

The primary challenge in benchmarking LLMs for complex tasks arises from varied LLM performance dependent on the specificity and style of prompts. Complex task prompts can vary widely in detail and style, impacting reproducibility and comparability across studies. TELeR aims to provide a standardized framework for categorizing and designing prompts to address this issue. By consolidating disparate prompting techniques under a unified taxonomy, TELeR facilitates the accurate and meaningful comparison of LLM capabilities across studies.

Prompt Engineering for Complex Tasks

Prompt engineering for complex tasks involves iteratively refining input prompts to maximize LLM efficacy in generating desired outputs. This involves a comprehensive understanding of the nuances in prompt design, such as specifying clear goals, incorporating distinct sub-tasks, and formulating few-shot examples. Given the complexity of these tasks, the inclusion of rich task context, varied expression styles, and interactivity level are crucial components of successful prompt engineering.

The Dimensions of TELeR

TELeR classifies prompts along four critical dimensions which provide a granular understanding of how different prompt designs influence LLM performance:

Turn: Classifies the interaction style into single or multi-turn sessions, reflecting dialog history and sequence of interactions.
Expression: Distinguishes between question-style and instruction-style directives, depending on the nature of task descriptions.
Role: Evaluates the explicit definition of system roles within prompts, affecting the interpretative context.
Level of Details: Spans seven distinct levels based on the depth of directive specifics, ranging from minimal detail to comprehensive directives with explanations, evaluation criteria, and illustrative examples.

Practical Use Cases

The paper illustrates TELeR's practical application through two use cases: meta-review generation from peer reviews and narrative braiding. These cases highlight the taxonomy's relevance in structuring prompts for tasks exceeding simple information retrieval or synthesis. These examples demonstrate how applying the TELeR framework enables consistent task formulation for complex scenarios, ensuring meaningful LLM performance evaluation.

Use Case 1: Meta-Review Generation

Meta-review generation exemplifies TELeR's utility in collating peer review sentiments into cohesive narratives. Variations in prompt levels from basic directives to detailed task breakdowns illustrate how each level influences the LLM's output quality and relevance.

Use Case 2: Narrative Braiding

In narrative braiding, multiple storylines are interwoven into a coherent narrative. By utilizing the structured decomposition outlined in TELeR, prompts can better guide LLMs in synthesizing overlapping and unique narrative elements into a unified storyline.

Conclusion

TELeR promises to advance LLM benchmarking by promoting a standardized approach to prompt categorization, improving comparison validity and enhancing reproducibility in research. While not exhaustive, TELeR represents an initial step towards a comprehensive taxonomy tailored for complex task evaluation. Researchers and developers are encouraged to leverage this framework to foster consensus on LLM performance assessments and contribute to its iterative enhancement for broader applicability.