BixBench Computational Biology Benchmark

Updated 25 January 2026

BixBench is a large-scale, open-source benchmark that assesses LLM-based agents on practical, hypothesis-driven biological analyses using real-world datasets.
It features 296 expert-curated questions across open-answer and multiple-choice regimes to evaluate multi-step reasoning and data manipulation skills.
By challenging current LLM agents on complex bioinformatics workflows, BixBench identifies critical performance gaps and guides future improvements.

BixBench is a large-scale, open-source benchmark specifically designed to evaluate the end-to-end capabilities of LLM-based agents in computational biology. Its development addresses the absence of comprehensive standards to quantify and analyze the ability of autonomous agents to perform practical, hypothesis-driven biological data analysis. BixBench is comprised of 53 real-world analytical scenarios, each encapsulated as a self-contained, executable unit, and a total of 296 rigorously curated open-answer questions, collectively probing the multi-step reasoning, data manipulation, and interpretation skills required for computational biology research (Mitchener et al., 28 Feb 2025).

1. Motivation and Scope

BixBench is motivated by the inherently multi-step and hypothesis-driven workflows of modern bioinformatics. Unlike prior benchmarks focusing on recall-based or constrained data science tasks, BixBench is designed to capture the open-ended, iterative nature of real-world biological data analysis. It quantifies whether LLM-based agents can autonomously load diverse omics or structural datasets, construct and execute valid analytic pipelines, and derive scientifically relevant conclusions from their analyses. This paradigm provides a rigorous progress metric and diagnostic tool for the development of “AI scientists” in computational biology.

2. Dataset Construction and Scenario Design

The BixBench dataset consists of 53 “capsules,” each representing a canonical bioinformatics task such as differential expression analysis, genome assembly quality control, or cell-type annotation. Each capsule contains:

A free-form executable code notebook (Google Colab–based)
Raw and processed data files (e.g., FASTA, FASTQ, CSV, RDS, structural files, metadata tables)
A statement of hypothesis, a summary of results, and a set of expert-validated ground-truth answers

Data sources include published studies, whose pipelines were re-run or recapitulated by expert analysts, as well as de novo analyses designed by PhD-level bioinformaticians. Capsules are frozen snapshots, ensuring reproducibility and encapsulating all materials required for independent validation.

Each scenario contains between 3 and 8 open-answer questions (mean 5.6) targeting both intermediate analytic steps and final scientific interpretations. The questions address challenges such as file format discovery, workflow planning, execution, and interpretation of results.

3. Task Regimes and Evaluation Protocols

BixBench defines two distinct evaluation regimes: open-answer and multiple-choice (MCQ).

Open-Answer Regime:

Agents interact with a notebook environment containing only the raw data and free-text questions. They are required to write and execute code (Python, R, or bash) to arrive at and submit a textual answer.

Multiple-Choice Regime:

Eight draft MCQs per capsule are programmatically generated and then curated by domain experts, yielding 3–8 approved MCQs per scenario. In this regime, following the open-answer step, agents are provided with MCQ options and have the opportunity to select “Insufficient information” if they elect to refuse to answer.

Multi-Step Workflows:

All scenarios necessitate planning (for example, deciding which files to load or which test to apply), iterative data processing (such as data cleaning, normalization, and clustering), and final interpretation. Questions probe not only ultimate conclusions but also key intermediate computational decisions.

4. Quantitative Metrics and Baselines

Each regime is evaluated over 10 independent runs per question to capture the stochastic nature of agent responses.

Open-Answer Metric:

Agent accuracy is computed as: $A_{\text{open}} = \frac{1}{N} \sum_{i=1}^Q \sum_{r=1}^R \delta_{i,r}$ where $N$ is the total number of questions (Q = 296) times the number of runs (R = 10), and $\delta_{i,r}$ is 1 if the run’s answer exactly matches the ground truth as judged by Claude 3.5 Sonnet, and 0 otherwise.

Multiple-Choice Metrics:

Accuracy ( $A_{\text{mcq}}$ ): fraction of majority-voted answers matching the ground truth
Precision ( $P$ ):

$P = \frac{|\{i: \hat y_i = y_i \wedge \hat y_i \neq \text{Refusal}\}|}{|\{i: \hat y_i \neq \text{Refusal}\}|}$

Recall ( $R$ ):

$R = \frac{|\{i: \hat y_i = y_i\}|}{Q}$

Random guessing baseline for MCQs is $1/K$ (roughly 25–33%, depending on number of choices). A “pure recall” baseline is also included by presenting questions without access to data or notebook.

5. Empirical Findings

BixBench exposes substantial shortcomings in current LLM-based agents’ ability to perform meaningful end-to-end bioinformatics.

Open-Answer Regime:
- GPT-4o achieves approximately 9% accuracy.
- Claude 3.5 Sonnet achieves approximately 17%.
- Pure recall performance is at or below these levels, indicating minimal effective use of analytical workflows.
Multiple-Choice Regime:
- When permitting “Refusal,” both models perform below random (≈20–25%), due to frequent refusals.
- Disallowing “Refusal” increases accuracy marginally above random (≈30–35%).
- Majority voting over 10 runs yields only slight improvements: GPT-4o at ~32%, Claude 3.5 at ~36%.

Error analysis shows prevalent modes of failure:

Incorrect file loading/parsing due to heterogeneity in formats and paths
Misapplication of statistical tests (e.g., inappropriate normalization or test selection)
Inaccurate or ignored plot interpretation

A vision ablation—disallowing plot generation and relying solely on tables—improves MCQ accuracy by roughly 5–7 percentage points, indicating current multimodal limitations in plot analysis. Increased parallel runs in the “Refusal” setting amplify refusal rates, slightly reducing accuracy; in the “No-Refusal” regime, ensemble voting yields only marginal benefit.

6. Agent Framework and Implementation

BixBench provides an open-source agent framework, architected atop Aviary, a gymnasium-style environment suited to language agents. The framework employs ReAct-style prompt engineering, in which the LLM systematically chooses among three available tools with explicit chain-of-thought reasoning:

list_workdir(): recursively discovers dataset files and structure
edit_cell(cell_index, new_code): modifies and executes code within a specified notebook cell
submit_answer(json_dict): ends the analytic trajectory and records the agent’s answer

The entire stack runs in a Dockerized execution environment (“BixBench-env:v1.0”) pre-configured with Python, R (via rpy2), and typical Unix bioinformatics packages. Each tool invocation re-executes the notebook from top to bottom, capturing intermediate tables, figures, and error messages for agent observation.

7. Limitations, Significance, and Future Directions

Empirical results demonstrate that current frontier LLMs are not capable of reliably executing complex, multi-stage bioinformatics workflows. Agents struggle with data heterogeneity, multimodal plot interpretation, and context-sensitive statistical reasoning, and frequently revert to recall or refusal. The findings highlight major gaps between current code-generating LLMs and the goal of self-sufficient AI bioinformaticians.

Recommended future directions include:

Expansion of BixBench to incorporate broader data modalities (e.g., proteomics, single-cell spatial transcriptomics) and more complex workflows (e.g., de novo genome assembly and annotation)
Establishment of a human-expert baseline, having skilled bioinformaticians solve the same capsules under timed conditions
Evaluation of next-generation “reasoning” LLMs (such as O1 and DeepSeek R1) and exploration of closed-loop reinforcement learning agents utilizing correctness feedback
Advancement of multimodal reasoning, potentially by furnishing machine-readable summaries of visualizations
Domain-specific fine-tuning focused on statistical best-practices and code-driven analysis tutorials

A plausible implication is that, by open-sourcing both the dataset and the agent framework, BixBench establishes a foundation for systematic advances in LLM-agent design and computational biology autonomy. The benchmark defines a clear, measurable challenge to the field and is anticipated to catalyze method development and objective performance assessment (Mitchener et al., 28 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BixBench Computational Biology Benchmark.