- The paper introduces CodeVisionary, a dual-stage framework that automates the evaluation of LLM-generated code by integrating multisource knowledge analysis with negotiation-based scoring.
- The framework outperforms baseline evaluators by achieving higher correlation scores (Pearson, Spearman, Kendall-Tau) with human judgments on challenging benchmarks.
- The approach produces detailed Markdown reports covering environment setup, task decomposition, execution results, and optimization suggestions for developers.
This paper introduces CodeVisionary, an agent-based framework designed to evaluate the code generation capabilities of LLMs (2504.13472). It addresses the limitations of existing evaluation methods: human evaluation is costly and time-consuming, metric-based methods often require hard-to-obtain reference code or tests, and current LLM-based methods lack access to diverse knowledge sources (like up-to-date documentation, runtime information, or visual feedback) and struggle to comprehend complex code.
CodeVisionary operates in two main stages:
- Multisource Knowledge Analysis Stage: This stage aims to gather comprehensive information needed for evaluation. An LLM agent orchestrates this process through a four-phase cycle:
- Construct: Sets up an isolated, executable environment (using Docker) based on the code generation task and the LLM's response. This includes installing necessary language interpreters, dependencies, and configuration files.
- Comprehend: The agent breaks down the initial code generation task into smaller, specific requirements to better understand the evaluation scope.
- Plan: The agent formulates a step-by-step evaluation plan. Each step has a goal (e.g., "Static Linter Analysis", "Dynamic Execution Analysis") and guidance on the action to take. Possible actions include:
-
Dynamic Execution Analysis: Running the code (e.g., python test.py, gcc test.c && ./output).
-
Static Linter Analysis: Checking syntax, style, and potential issues using appropriate linters (linter_analysis -f 'path').
-
Unit Tests Analysis: Writing and executing unit tests to check functionality and reliability.
-
Screenshot Analysis: Rendering front-end code (e.g., HTML/CSS) into an image and using a multimodal LLM for visual analysis (screenshot_analsis -f 'path' -q 'query').
-
Interaction Analysis: Simulating user interactions (clicks, hovers, input) on front-end code before screenshotting (screenshot_analysis -a 'actions').
-
Web Browsing Analysis: Searching the web for information, like documentation for new technologies (web_browse -q 'query').
-
General Semantic Analysis: Leveraging the LLM's own understanding to evaluate code logic, complexity, security, etc.
-
Bash Command: Performing file system operations (writing files, reading files, etc.).
- Analyze: The agent executes the plan step-by-step, interacting with the environment. It alternates between an
Execute State (performing the planned action) and an Analyze State (analyzing the results/observation from the environment and generating a report for that step using predefined templates). Hints are provided to guide the agent based on its current state. The reports from each step are collected.
- Negotiation-based Scoring Stage: To address potential biases and improve the assessment of complex code, this stage employs multiple LLM agents (e.g., 3 judges) who discuss and debate the evaluation.
- Each judge (Ai​) initially provides a score (Si​) and reasoning (Ri​) based on the information gathered in the first stage and predefined criteria (correctness, functionality, clarity).
- Scores and reasons are shared among judges.
- Judges engage in multiple rounds of discussion (e.g., up to 5 rounds). In each round, a judge can maintain their score, change their score with justification, or query another judge.
- The process terminates when consensus is reached or the maximum number of rounds is exceeded.
- The final evaluation score is the average of the judges' final scores.
Evaluation Report Generation:
CodeVisionary generates a detailed Markdown report summarizing the entire evaluation process. This report includes:
- The original code task and the LLM's response.
- The final evaluation score.
- Details of the environment setup.
- The decomposed task requirements.
- Step-by-step results from the analysis stage (including execution outputs, linter messages, screenshots, test results, etc.).
- The final evaluation reasoning derived from the negotiation stage.
- Optimization suggestions for the evaluated code.
The report is automatically formatted using Prettier and can be converted to PDF using Pandoc.
Implementation & Experiments:
- The framework uses an LLM (GPT-4o in experiments) as the controlling agent.
- Interactions involve the agent outputting "thought" (reasoning) and "action" (command to execute).
- Experiments were conducted on a benchmark derived from CodeArena (hard tasks), with responses generated by GPT-3.5-turbo, Claude-3.5-Sonnet, and GPT-4o, and manually scored by experts.
- CodeVisionary significantly outperformed baseline LLM-based evaluators (VANILLA, ICE-Score, CODEJUDGE) on correlation metrics (Pearson, Spearman, Kendall-Tau) against human judgments.
- Ablation studies confirmed the positive impact of both the Multisource Knowledge Analysis and Negotiation-based Scoring stages.
- The framework showed strong performance across various programming languages and coding scenarios, particularly excelling in evaluating UI-related tasks (leveraging Screenshot and Interaction Analysis) and tasks involving newer technologies (leveraging Web Browsing Analysis).
In summary, CodeVisionary provides a structured, automated, and comprehensive approach to evaluating LLM-generated code. By integrating external tools, multi-source knowledge gathering, and a multi-agent negotiation process, it aims to produce more accurate, reliable, and interpretable evaluations compared to existing methods, complete with detailed reports useful for developers.