CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

Published 18 Apr 2025 in cs.SE, cs.AI, cs.CL, and cs.LG | (2504.13472v1)

Abstract: LLMs have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities and superior efficiency. However, the performance of LLM-based approaches remains limited due to: (1) lack of multisource domain knowledge, and (2) insufficient comprehension of complex code. To mitigate the limitations, we propose CodeVisionary, the first LLM-based agent framework for evaluating LLMs in code generation. CodeVisionary consists of two stages: (1) Multiscore knowledge analysis stage, which aims to gather multisource and comprehensive domain knowledge by formulating and executing a stepwise evaluation plan. (2) Negotiation-based scoring stage, which involves multiple judges engaging in discussions to better comprehend the complex code and reach a consensus on the evaluation score. Extensive experiments demonstrate that CodeVisionary achieves the best performance for evaluating LLMs in code generation, outperforming the best baseline methods with average improvements of 0.202, 0.139, and 0.117 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. Besides, CodeVisionary provides detailed evaluation reports, which assist developers in identifying shortcomings and making improvements. The resources of CodeVisionary are available at https://anonymous.4open.science/r/CodeVisionary.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CodeVisionary, a dual-stage framework that automates the evaluation of LLM-generated code by integrating multisource knowledge analysis with negotiation-based scoring.
The framework outperforms baseline evaluators by achieving higher correlation scores (Pearson, Spearman, Kendall-Tau) with human judgments on challenging benchmarks.
The approach produces detailed Markdown reports covering environment setup, task decomposition, execution results, and optimization suggestions for developers.

This paper introduces CodeVisionary, an agent-based framework designed to evaluate the code generation capabilities of LLMs (2504.13472). It addresses the limitations of existing evaluation methods: human evaluation is costly and time-consuming, metric-based methods often require hard-to-obtain reference code or tests, and current LLM-based methods lack access to diverse knowledge sources (like up-to-date documentation, runtime information, or visual feedback) and struggle to comprehend complex code.

CodeVisionary operates in two main stages:

Multisource Knowledge Analysis Stage: This stage aims to gather comprehensive information needed for evaluation. An LLM agent orchestrates this process through a four-phase cycle:
- Construct: Sets up an isolated, executable environment (using Docker) based on the code generation task and the LLM's response. This includes installing necessary language interpreters, dependencies, and configuration files.
- Comprehend: The agent breaks down the initial code generation task into smaller, specific requirements to better understand the evaluation scope.
- Plan: The agent formulates a step-by-step evaluation plan. Each step has a goal (e.g., "Static Linter Analysis", "Dynamic Execution Analysis") and guidance on the action to take. Possible actions include:
  - Dynamic Execution Analysis: Running the code (e.g., python test.py, gcc test.c && ./output).
  - Static Linter Analysis: Checking syntax, style, and potential issues using appropriate linters (linter_analysis -f 'path').
  - Unit Tests Analysis: Writing and executing unit tests to check functionality and reliability.
  - Screenshot Analysis: Rendering front-end code (e.g., HTML/CSS) into an image and using a multimodal LLM for visual analysis (screenshot_analsis -f 'path' -q 'query').
  - Interaction Analysis: Simulating user interactions (clicks, hovers, input) on front-end code before screenshotting (screenshot_analysis -a 'actions').
  - Web Browsing Analysis: Searching the web for information, like documentation for new technologies (web_browse -q 'query').
  - General Semantic Analysis: Leveraging the LLM's own understanding to evaluate code logic, complexity, security, etc.
  - Bash Command: Performing file system operations (writing files, reading files, etc.).
- Analyze: The agent executes the plan step-by-step, interacting with the environment. It alternates between an Execute State (performing the planned action) and an Analyze State (analyzing the results/observation from the environment and generating a report for that step using predefined templates). Hints are provided to guide the agent based on its current state. The reports from each step are collected.
Negotiation-based Scoring Stage: To address potential biases and improve the assessment of complex code, this stage employs multiple LLM agents (e.g., 3 judges) who discuss and debate the evaluation.
- Each judge ( $A_i$ ) initially provides a score ( $S_i$ ) and reasoning ( $R_i$ ) based on the information gathered in the first stage and predefined criteria (correctness, functionality, clarity).
- Scores and reasons are shared among judges.
- Judges engage in multiple rounds of discussion (e.g., up to 5 rounds). In each round, a judge can maintain their score, change their score with justification, or query another judge.
- The process terminates when consensus is reached or the maximum number of rounds is exceeded.
- The final evaluation score is the average of the judges' final scores.

Evaluation Report Generation:

CodeVisionary generates a detailed Markdown report summarizing the entire evaluation process. This report includes:

The original code task and the LLM's response.
The final evaluation score.
Details of the environment setup.
The decomposed task requirements.
Step-by-step results from the analysis stage (including execution outputs, linter messages, screenshots, test results, etc.).
The final evaluation reasoning derived from the negotiation stage.
Optimization suggestions for the evaluated code. The report is automatically formatted using Prettier and can be converted to PDF using Pandoc.

Implementation & Experiments:

The framework uses an LLM (GPT-4o in experiments) as the controlling agent.
Interactions involve the agent outputting "thought" (reasoning) and "action" (command to execute).
Experiments were conducted on a benchmark derived from CodeArena (hard tasks), with responses generated by GPT-3.5-turbo, Claude-3.5-Sonnet, and GPT-4o, and manually scored by experts.
CodeVisionary significantly outperformed baseline LLM-based evaluators (VANILLA, ICE-Score, CODEJUDGE) on correlation metrics (Pearson, Spearman, Kendall-Tau) against human judgments.
Ablation studies confirmed the positive impact of both the Multisource Knowledge Analysis and Negotiation-based Scoring stages.
The framework showed strong performance across various programming languages and coding scenarios, particularly excelling in evaluating UI-related tasks (leveraging Screenshot and Interaction Analysis) and tasks involving newer technologies (leveraging Web Browsing Analysis).

In summary, CodeVisionary provides a structured, automated, and comprehensive approach to evaluating LLM-generated code. By integrating external tools, multi-source knowledge gathering, and a multi-agent negotiation process, it aims to produce more accurate, reliable, and interpretable evaluations compared to existing methods, complete with detailed reports useful for developers.

Markdown