Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jr. AI Scientist: A Modular Research Platform

Updated 15 January 2026
  • Jr. AI Scientist System is a modular, agentic framework that simulates a junior scientist's workflow by generating hypotheses and planning experiments.
  • It employs staged agent modules and progressive tree-search methods to optimize experiment scheduling and debug code in parallel.
  • The system autonomously analyzes data, visualizes outputs, and drafts manuscripts, achieving peer-review success in benchmark tests.

A Jr. AI Scientist System is a modular, agentic software platform replicating the workflow of a human junior scientist, capable of proposing hypotheses, planning and running experiments, analyzing results, visualizing findings, and autonomously drafting scholarly manuscripts. Such systems implement a staged agent architecture, advanced tree-search methods for experiment scheduling, and automated review and refinement mechanisms—closely tracing the “AI Scientist-v2” system, which demonstrated the first AI-generated paper accepted in peer-review without template code or human orchestration (Yamada et al., 10 Apr 2025).

1. Agentic System Architecture

The Jr. AI Scientist System comprises five dedicated agent modules, each fulfilling a core scientific role:

  • Hypothesis Generator (HG): Synthesizes kk candidate research questions and experimental blueprints.
  • Experiment Manager (EM): Executes and schedules code generation and experiment runs via progressive agentic tree search.
  • Data Analyzer (DA): Applies statistical or ML methods to experiment outputs and creates summary tables/figures.
  • Manuscript Author (MA): Crafts the scientific manuscript, integrating textual results, tables, and figures.
  • AI Reviewer (AR): Evaluates manuscript drafts (text and figures) on clarity, novelty, rigor, reproducibility, and impact; invokes a Vision-LLM (VLM) feedback loop for iterative figure refinement.

The agent interaction protocol follows a sequential cycle:

  1. HG outputs {Hi}\{H_i\}.
  2. EM launches parallel tree searches for each HiH_i.
  3. Each EM node generates code, runs experiments, passes results to DA.
  4. DA analyzes outputs; figures are sent to VLM for critique.
  5. EM marks nodes as buggy/non-buggy, prunes/expands per LLM evaluation.
  6. End-of-budget, EM submits best checkpoints and DA summaries to MA.
  7. MA drafts manuscript, embedding DA outputs.
  8. AR reviews across five axes and VLM feedback.
  9. MA revises to obtain final manuscript.

2. Progressive Agentic Tree-Search Methodology

The experiment manager implements a best-first parallel tree search, progressing through four explicit stages:

Algorithmic flow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
\begin{algorithmic}[1]
\State \mathcal{T} \leftarrow \mathrm{Tree}(root: H)
\While{budget remains}
  \State \mathcal{C} \leftarrow \{\text{select nodes by scoring policy}\}
  \ForAll{n \in \mathcal{C} \textbf{ in parallel}}
    \If{n.status == buggy}
      \State n' \leftarrow \mathrm{DebugNode}(n)
    \Else
      \State n' \leftarrow \mathrm{RefineNode}(n)
    \EndIf
    \State \mathrm{ExecuteCode}(n')   \Comment{experiment run}
    \State DA.analyze(n')
    \State VLM.review_figures(n')
    \State n'.status \leftarrow buggy/non-buggy
    \State \mathrm{Insert\ child\ }n'\mathrm{\ under\ }n\mathrm{\ in\ }\mathcal{T}
  \EndFor
  \State budget \leftarrow budget - cost(\mathcal{C})
\EndWhile
\State \Return top-k non-buggy leaves
\end{algorithmic}

The state s=(code_script,metrics,error_log,visuals,VLM_feedback)s = (\text{code\_script}, \text{metrics}, \text{error\_log}, \text{visuals}, \text{VLM\_feedback}) and available actions {Debug,Refine,HyperparamSweep,Ablation,Replicate}\{ \mathtt{Debug}, \mathtt{Refine}, \mathtt{HyperparamSweep}, \mathtt{Ablation}, \mathtt{Replicate} \}. Selection policy follows a UCT (upper confidence bound for trees):

π(s)=argmaxa[Q(s,a)+clnN(s)N(s,a)]\pi(s) = \arg\max_a \left[ Q(s,a) + c \sqrt{\frac{\ln N(s)}{N(s,a)}} \right]

with value backup and reward combination:

Q(s,a)(1α)Q(s,a)+αr(s,a)Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha r(s,a)

where r(s,a)r(s,a) combines experiment metric (e.g., validation accuracy) and VLM figure score.

Hypothesis scoring and pruning employ:

score(n)=λ1perf(n)+λ2novelty(H)λ3bugs(n)\text{score}(n) = \lambda_1 \text{perf}(n) + \lambda_2 \text{novelty}(H) - \lambda_3 \text{bugs}(n)

Nodes with low score or repeated failures are pruned.

3. Experimental Design, Analysis, and Visualization

Experimental control encompasses dynamic node scheduling and rigorous Design-of-Experiments principles.

  • Compute is dynamically allocated to nodes; buggy nodes are debugged prior to depth limit.
  • Hyperparameter sweeps sample xN(μ,Σ)\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma}) optimized via Bayesian strategies:

maxxperf(x)s.t.  runtimeCmax\max_{\mathbf{x}} \text{perf}(\mathbf{x}) \quad \text{s.t.} \; \sum \text{runtime} \leq C_{max}

Ablation studies toggle ci{0,1}c_i \in \{0,1\} for systematic robustness evaluation.

Data Analyzer processes outputs using regression and clustering:

  • Regression: y^=β0+β1x+ε\hat y = \beta_0 + \beta_1 x + \varepsilon
  • Clustering: kk-means minimizes i=1nminjxiμj2\sum_{i=1}^{n} \min_j \|x_i - \mu_j\|^2

Visual outputs are immediately checked by the VLM loop; image and text embeddings fimg,ftxtf_{\mathrm{img}}, f_{\mathrm{txt}} are scored via

sVLM=σ(fimgWftxt+b)s_{\mathrm{VLM}} = \sigma \left( f_{\mathrm{img}}^\top W f_{\mathrm{txt}} + b \right)

Outputs below threshold τ\tau are marked buggy and queued for revision.

4. Autonomous Manuscript Authoring and Review

Manuscript Author is tasked to:

  1. Generate structured outline (sections, fig/table placeholders).
  2. Fill text via LLM, conditioned on DA outputs.
  3. Compile to PDF, check page constraints.
  4. Self-review and style correction.

AI Reviewer evaluates with:

Axis Metric Range
Clarity 1–10
Novelty 1–10
Rigor 1–10
Reproducibility 1–10
Impact 1–10

VLM iteratively refines figures for readability, color harmony, and aspect ratio, and can request code-level changes.

5. Empirical Results and Comparative Benchmarks

Evaluation involved three fully autonomous ICLR workshop submissions:

Paper Score Decision
Compositional Regularization 6.33 Accepted
Label Noise Calibration 3\sim 3 Rejected
Real-world Pest Detection 3\sim 3 Rejected
  • AI Scientist-v2: 1/3 accepted; surpassed average human acceptance threshold.
  • Compared to v1 (linear pipeline, template code, 0/3 accepted), v2 achieved deeper domain-general code generation and peer-review success.
  • Percentile achieved: 45%\approx45\% among workshop submissions.

6. Implementation, Reproducibility, and Limitations

Open-source code is structured with modular agent scripts and configuration files; rapid deployment and reproducibility are supported via:

1
2
3
4
5
6
7
git clone https://github.com/SakanaAI/AI-Scientist-v2.git
cd AI-Scientist-v2
pip install -r requirements.txt
python -m ai_scientist.main \
  --config configs/jr_scientist.json \
  --random-seed 42 \
  --max-time 6h

Core limitations:

  • LLM hallucinations in citation and method sections.
  • Shallow novelty vs. human experts.
  • Compute- and resource-intensive tree search.
  • Risks: Flawed code or misleading results may be generated; human audit advised prior to publication.

Future avenues include code snippet verification, leveraging domain-specific knowledge graphs, and a junior variant limiting tree-search branching and compute—tailored for educational labs and close oversight (Yamada et al., 10 Apr 2025).

7. Jr. AI Scientist Variant Design Guidance

A scaled-down Jr. AI Scientist is recommended to:

  • Reduce tree-search branching, restrict to two stages and ten total nodes.
  • Maintain all agent modules, but employ stage caps for computational tractability.
  • Encourage human auditing and close oversight in teaching and research environments.
  • Retain modularity and VLM feedback for experimental robustness, with configurations specified in concise JSON for reproducibility.

This Jr. AI Scientist design inherits and translates all core architectural, algorithmic, and evaluative features of the AI Scientist-v2 platform to a domain-general, resource-conservative system suitable for broad adoption in educational, research, and computational science settings (Yamada et al., 10 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Jr. AI Scientist System.