EvalAgent: Discovering Implicit Evaluation Criteria from the Web

Published 21 Apr 2025 in cs.CL | (2504.15219v1)

Abstract: Evaluation of LLM outputs on structured writing tasks is typically conducted with a number of desirable criteria presented to human evaluators or LLMs. For instance, on a prompt like "Help me draft an academic talk on coffee intake vs research productivity", a model response may be evaluated for criteria like accuracy and coherence. However, high-quality responses should do more than just satisfy basic task requirements. An effective response to this query should include quintessential features of an academic talk, such as a compelling opening, clear research questions, and a takeaway. To help identify these implicit criteria, we introduce EvalAgent, a novel framework designed to automatically uncover nuanced and task-specific criteria. EvalAgent first mines expert-authored online guidance. It then uses this evidence to propose diverse, long-tail evaluation criteria that are grounded in reliable external sources. Our experiments demonstrate that the grounded criteria produced by EvalAgent are often implicit (not directly stated in the user's prompt), yet specific (high degree of lexical precision). Further, EvalAgent criteria are often not satisfied by initial responses but they are actionable, such that responses can be refined to satisfy them. Finally, we show that combining LLM-generated and EvalAgent criteria uncovers more human-valued criteria than using LLMs alone.

Abstract PDF Upgrade to Chat

Summary

The paper introduces EvalAgent, a framework that leverages expert web data to uncover implicit, task-specific evaluation criteria for LLM outputs.
It employs query generation, expert retrieval, and criteria synthesis to identify nuanced writing qualities beyond explicit prompts.
Experiments show that EvalAgent criteria are more specific, implicit, and actionable, leading to improved response refinement and alignment with human judgment.

This paper introduces EvalAgent, a framework designed to automatically discover implicit evaluation criteria for LLM outputs on complex writing tasks (2504.15219). The core problem addressed is that standard evaluation often relies on criteria explicitly stated in the prompt (e.g., "write an academic talk") or very obvious unstated ones (e.g., "be coherent"), missing the nuanced, task-specific qualities that define high-quality writing (e.g., an academic talk should have a compelling opening, clear research questions, and a takeaway).

EvalAgent aims to uncover these implicit criteria—unstated but desired properties specific to the task—by leveraging expert knowledge available on the web. The framework operates in several steps:

Query Generator: Given a user prompt, an LLM generates conceptual search queries (e.g., "how to draft an academic talk," "how to write an engaging talk") designed to retrieve instructional web documents relevant to the type of writing requested, not just the topic.
Expert Retriever: For each query, it searches the web, retrieves URLs, and filters them based on expertise and relevance to the original prompt using an LLM scorer. It then extracts answers to the query from the top-ranked filtered documents (e.g., university websites, expert blogs) and summarizes these answers into a query-specific list of criteria.
Criteria Generator: It aggregates the criteria lists generated for all queries, synthesizes them into a unified list, and rewrites them to be specific evaluation points aligned with the original user prompt (e.g., "the response should focus on big picture questions").
Ranking: The generated criteria are ranked based on their relevance to the user prompt using an LLM.

The paper proposes that ideal criteria should possess three properties:

Specificity (S): Using precise, less common terms (measured by Normalized Inverse Word Frequency).
Implicitness (I): Not directly overlapping with the words in the original prompt (measured by 1 - Word Overlap).
Actionability (A): Enabling tangible improvements when used to guide response revision (measured by the success rate of revising an initial response to satisfy the criterion).

Experiments were conducted across nine datasets, including a newly collected dataset called Ask-then-Critique, where users evaluated LLM responses to their own prompts. EvalAgent's generated criteria (EA-Web) were compared against baselines like Instruction Decomposition (ID - criteria explicitly from the prompt) and LLM-prompted criteria (LLM and LLM-n, which generates more criteria and ranks them).

Key findings include:

EvalAgent criteria exhibit higher specificity and implicitness scores compared to LLM-generated criteria.
Human evaluations rated EvalAgent criteria as less obvious than LLM-generated ones while maintaining high utility.
EvalAgent criteria demonstrated higher actionability, leading to larger improvements when used for response refinement and identifying criteria that initial model outputs often failed to meet.
Combining EvalAgent criteria with LLM-generated criteria (EA-Full) resulted in higher recall of human-written criteria compared to purely LLM-based methods generating a similar number of criteria.

The main contributions are the EvalAgent framework itself, the introduction of metrics to evaluate criteria quality (Specificity, Implicitness, Actionability), and the demonstration that mining web-based expert advice allows for the scalable generation of nuanced, actionable, and human-aligned evaluation criteria for LLMs.

Markdown Report Issue