Papers
Topics
Authors
Recent
Search
2000 character limit reached

KLong: Training LLM Agent for Extremely Long-horizon Tasks

Published 19 Feb 2026 in cs.AI and cs.CL | (2602.17547v1)

Abstract: This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

Summary

  • The paper demonstrates that decomposing long trajectories into overlapping sub-trajectories with trajectory-splitting SFT increases feasible assistant turns from 114.9 to 732.7.
  • The paper introduces progressive RL with incrementally extended timeouts, effectively stabilizing training for tasks with delayed rewards and complex dependencies.
  • The paper leverages an automated Research-Factory pipeline to generate high-quality training data, enabling KLong to outperform larger models across multiple benchmarks.

Authoritative Analysis of "KLong: Training LLM Agent for Extremely Long-horizon Tasks"

Problem Motivation and Task Definition

The paper addresses the challenge of training LLM agents for extremely long-horizon tasks, specifically those that far exceed typical context windows and demand sustained agentic reasoning and interaction over hundreds to thousands of turns. Examples include research paper replication (PaperBench) and machine learning engineering (MLE-bench), which require the agent to operate in a constrained environment, plan, and execute sequences of tool calls, often under time and resource limits. The complexity and scope encountered in these tasks represent a frontier in agentic LLM research, as most prior work deals only with shorter-horizon or context-managed tasks. The central goal is to elevate LLM agents from solving short, well-defined tasks to reliably handling scenarios involving extended multi-stage workflows and iterative experimentation.

Research-Factory: Automated Data Pipeline

A significant contribution is the Research-Factory pipeline for scalable and systematic training data generation tailored to the research replication task. Figure 1

Figure 1: Research-Factory pipeline automates data collection, filtering, PDF-to-Markdown conversion, decontamination, and rubric construction for reproducible research benchmarks.

This multi-component system performs:

  • Search: Automated retrieval of accepted papers from top conferences (ICML, NeurIPS, ICLR) supplemented with metadata.
  • Filtering: Ensures quality, impact, and removes test leakage.
  • Conversion: Transforms the PDF to Markdown, standardizing data for ease of processing.
  • Rubric Creation: The evaluation agent constructs hierarchical rubrics and addenda after analyzing both the paper and reference implementations.
  • Anti-cheating: Registers official code repositories in a blacklist to prevent direct copying.

Together, these mechanisms generate thousands of high-quality long-horizon trajectories, distilled using Claude 4.5 Sonnet, enabling robust supervised and RL training. Rubric-based evaluation ensures the reward signals are explicit and aligned with task goals.

Training Paradigm: Trajectory-splitting SFT and Progressive RL

The core training methodology tackles context-overrun and credit assignment by a two-stage protocol:

  1. Trajectory-Splitting SFT: Cold-starts the agent by decomposing extremely long trajectories into overlapping sub-trajectories, pinning early context (e.g., paper content/specification) across slices, with progressive truncation of later context. This design maintains global task awareness while ensuring model inputs respect context limitations. Each sub-trajectory receives both prefix context and relevant history, substantially increasing feasible assistant turns from 114.9 to 732.7 and enabling the model to acquire extended reasoning capabilities. Figure 2

    Figure 2: Ablation results isolating the impact of trajectory-splitting SFT, progressive RL, and showing assistant turns, runtime, and performance.

  2. Progressive RL: Schedules RL training into multiple stages, each with incrementally extended timeouts (e.g., 2, 4, 6 hours). This mitigates the instability and sparsity inherent in RL for extremely long tasks, allowing the agent to gradually adapt to tasks with highly delayed reward structures and extended dependencies. The training objective leverages a group-relative advantage, KL regularization, and combines rollouts iteratively split into manageable sub-trajectories. Infrastructure-level optimizations ensure pipeline balance and judge scalability. Figure 3

    Figure 3: RL training curves on PaperBench showing steady performance improvements, increased assistant turns, and entropy reduction as RL horizons are extended.

Performance and Generalization Results

KLong, at 106B parameters, achieves best average open-source performance (62.59 on PaperBench), surpassing Kimi K2 Thinking (1T) by 11.28%, despite a much smaller model size. The performance advantage generalizes to other long-horizon coding benchmarks:

  • SWE-bench Verified: 62.80% pass rate vs baseline's 60.80%
  • Terminal-Bench Hard: 16.67% success vs 14.58%
  • SEC-bench: 7.67% overall vs 5.00%
  • MLE-bench: Higher achievement rates (Above Median, Bronze, Gold) across a range of competitions

Ablation analysis confirms that the splitting SFT provides substantial gains (+17.29 points) and progressive RL yields further improvements (+6.67 points at RL$_{6\mathrm{H}$), validating the necessity of both components for scaling agentic abilities. Figure 4

Figure 4

Figure 4: SFT loss trajectory confirming stable supervised initialization for downstream RL.

Comparisons against both closed- and open-source baselines underscore KLong's ability to close the gap with proprietary models and occasionally outperform them on specific tasks, especially those requiring adaptation, sustained reasoning, or complex sequential execution.

Implementation and Infrastructure Insights

The system leverages advanced sandboxing (Kubernetes clusters, ephemeral containers, pre-installed ML frameworks), optimized scaffolding for context-tracking and error handling, and priority queuing for evaluation to ensure efficient experimentation at scale. Judge models and rubrics are carefully designed to minimize bias and maximize evaluation consistency. The RL pipeline is engineered to exploit partial rollouts and asynchronous judging to prevent resource bottlenecks typical in extremely long-horizon rollouts.

Implications, Limitations, and Future Directions

KLong demonstrates that context-aware trajectory decomposition, rubric-driven RL schedules, and scalable infrastructure enable precise, high-performing LLM agents for extremely long-horizon tasks. Practically, this unlocks AI-driven automation for research replication, large-scale ML engineering, multi-stage code security, and other complex workflows, suggesting a pathway toward generalist agents capable of handling real-world, multi-hour/multi-day scientific projects.

Theoretically, KLong's results challenge assumptions about agentic limitations imposed by context window and delayed reward, showing progressive RL and trajectory slicing outs perform simple SFT or naive RL in this regime. Limitations include the approximate nature of trajectory decomposition (potential loss of distant dependencies), resource requirements, and judge-induced reward biases. Addressing these may entail research into hierarchical memory architectures, hybrid RL-HF approaches, and scalable human feedback integration.

Conclusion

KLong establishes a new standard for training LLM agents on extremely long-horizon tasks. The combination of Research-Factory for dataset generation, trajectory-splitting SFT for context preservation, and progressive RL for stabilization yields an agent with robust generalization across diverse challenging benchmarks. This methodology provides a foundation for further development of scalable, generalist LLM agents targeting automation in complex, real-world engineering and scientific domains.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces KLong, a new open-source AI “agent” built to handle extremely long tasks, like reading a full research paper, writing code to recreate its results, and running long experiments. These tasks can take many hours and hundreds of steps, which is hard for most AI models. KLong is trained with a special recipe so it can remember important parts, plan across many steps, and improve through practice.

Goals

The paper asks a simple question: How can we train an AI to successfully complete very long, complicated tasks that:

  • take a lot of time and many back-and-forth actions, and
  • are longer than the AI can “fit” into its memory at once?

The main goal is to make the AI better at:

  • reproducing machine learning research papers (reading, coding, running experiments),
  • fixing software bugs,
  • writing and running code in a computer terminal, and
  • tackling security-related coding tasks.

How They Did It

To help readability, here are the main parts of their approach, explained in everyday language.

Building the training data: Research-Factory

  • Imagine a factory that gathers lots of high-quality homework tasks. The team built “Research-Factory,” an automated pipeline that:
    • collects papers from top conferences (like ICML, NeurIPS, ICLR),
    • filters them for quality,
    • converts the PDFs into a friendly text format,
    • creates a grading checklist (a “rubric”) for each paper, and
    • blocks the official GitHub code so the AI can’t just copy answers.
  • They then used a strong model (Claude 4.5 Sonnet, Thinking mode) to produce thousands of example solution paths for these tasks. Think of these as “gold standard” study guides showing how to read the paper, plan, code, and test.

Teaching the model basic skills: Supervised Fine-Tuning (SFT)

  • First, they give KLong a “starter course” using SFT. This is like having a teacher show many examples of correct answers in:
    • general knowledge,
    • coding,
    • math, and
    • online search.
  • This step helps KLong learn basic skills before tackling very long tasks.

Handling long tasks: Trajectory-splitting SFT

  • AI models have a limited “context window”—like a memory buffer. Very long tasks don’t fit.
  • Their trick: split the long task into overlapping chunks, like breaking a long movie into episodes with small recaps between them.
    • They always keep the “paper reading and task description” pinned at the start of each chunk (so the AI remembers the big picture).
    • They then include overlapping slices of later steps, so the story flows smoothly between chunks.
  • This lets the AI train on very long processes without forgetting the important early parts.
  • Result: the AI handled many more “assistant turns” (steps). In numbers, average steps went up from about 115 to about 733, showing much longer, sustained work.

Making the model better with practice: Progressive Reinforcement Learning (RL)

  • Reinforcement Learning is like practicing a video game and getting points based on how well you do, then adjusting your strategy.
  • The team used “progressive RL,” which slowly increases time limits for each practice session:
    • Start with shorter timeouts (e.g., 2 hours), then move to longer ones (e.g., 4 hours, 6 hours).
    • This reduces frustration from very long, hard tasks and helps the AI learn steadily.
  • They also split these long practice runs into chunks (like before) and used a “judge” to score results based on the rubric.
  • This boosted performance further—about +6.67% improvement after RL versus SFT-only.

Engineering improvements to make it work

  • Sandbox: a secure, scalable computer environment to run code at large scale, with popular ML packages pre-installed.
  • Better scaffolding: rules and tools that make the AI read the paper first, handle long text safely, and avoid “ending early.”
  • Smarter training pipeline: avoid traffic jams when many practice runs finish at once by staggering evaluations and using priority queues so important checks get done first.
  • Cost-saving judge: they used a strong open-source judge model during training to score progress and avoid overfitting to the official benchmark.

Main Findings

  • On PaperBench (a challenge where AI must recreate the results of real AI papers), KLong performed best among open-source models. It beat Kimi K2 Thinking (a much larger model) by 11.28% on average, even though KLong is smaller.
  • KLong’s improvements also carried over to other tasks:
    • SWE-bench Verified (bug fixing): higher pass rate,
    • Terminal-Bench Hard (command-line coding): higher success rate,
    • SEC-bench (software security tasks): better overall success,
    • MLE-bench (ML competitions): stronger results, including Above Median, Bronze, and even Gold medals in some competitions.
  • Training tricks mattered:
    • Trajectory-splitting SFT enabled much longer multi-step behavior.
    • Progressive RL further improved performance and stability over longer time horizons.

Why This Is Important

  • Real-world work is rarely a one-shot question. It’s long, messy, and multi-step—like building a project, doing research, or debugging complex software. KLong shows how to train AI to handle that kind of work.
  • The method (splitting long tasks into overlapping chunks and gradually extending practice time) gives a practical recipe other teams can use.
  • Because KLong is open-source, researchers and developers can study, improve, and adapt it for their own long, complex tasks.
  • In the big picture, this helps move AI from quick answers to sustained problem-solving—reading deeply, planning carefully, and finishing tough jobs end-to-end.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, articulated as concrete, actionable directions for future work.

  • Data provenance and teacher bias: The method distills thousands of trajectories from Claude 4.5 Sonnet (Thinking), but the paper does not quantify coverage, diversity, or bias of these trajectories. Evaluate how teacher choice affects learned behaviors (e.g., style imitation, tool-use patterns, failure modes) via ablations with multiple teachers and mixed-teacher datasets.
  • Decontamination rigor: The paper states “careful decontamination” to avoid test-set papers but provides no methodology. Detail exact decontamination checks (title/author/code URL matching, semantic similarity, time cutoff), false positive/negative rates, and publish audit scripts to ensure benchmark integrity.
  • Rubric construction inconsistency: The text alternately claims rubrics are “written by original authors” and “constructed by an evaluation agent analyzing official code.” Resolve this inconsistency by specifying the exact rubric source per task, the proportion of human-authored vs agent-generated rubrics, and measure inter-rater reliability or human validation quality.
  • Risk of rubric leakage/overfitting: Using official code to construct rubrics may seed task-specific implementation details into the evaluation signal. Study whether training rewards inadvertently reflect privileged knowledge and whether models learn to exploit rubric patterns. Include controls where rubrics are derived solely from papers without code inspection.
  • Trajectory-splitting hyperparameters: No ablation is provided on segment length L, overlap O, or prefix p size. Conduct sensitivity analyses to quantify trade-offs between contextual continuity, compute overhead, and learning stability, and recommend default settings per context window.
  • Credit assignment under splitting: Splitting long trajectories into overlapping sub-trajectories may break long-range dependencies. Evaluate credit assignment fidelity on synthetic tasks with known delayed rewards and on real tasks where early decisions affect late outcomes. Compare to alternatives (e.g., hierarchical credit assignment, memory-augmented policies).
  • Inference-time state management: Training uses trajectory splitting to fit the context window, but the paper does not describe how inference maintains state across 700+ turns without sophisticated memory/context management. Detail the inference-time state strategy (buffering, retrieval, external memory) and quantify its impact on success rates and latency.
  • Progressive RL schedule design: The choice of timeouts (2H, 4H, 6H) is heuristic and not justified. Explore principled schedule design (e.g., curriculum based on observed return/variance, adaptive timeout setting), include a 12H stage to match task horizons, and compare to asynchronous RL baselines optimized for long tasks.
  • RL algorithm details and ablations: The paper uses a PPO-like loss with group-relative advantage, but does not report epsilon, KL coefficient β, rollout count n, or stage-specific Km. Provide full hyperparameters, ablations on advantage normalization schemes, and stability metrics (variance, KL drift) to enable reproducibility and principled tuning.
  • Reward hacking and judge dependency: Rewards are assigned by a single judge model (gpt-oss-120b) using constructed rubrics. Assess susceptibility to reward hacking by testing across multiple independent judges (including o3-mini), human-in-the-loop verification on a subset, and cross-judge correlation analyses across all tasks (not just 3 models).
  • Statistical rigor: Results lack uncertainty estimates, multiple seeds, and significance testing. Report per-task confidence intervals, seed variability, and effect sizes to establish robustness, especially for small improvements on SWE-bench Verified and Terminal-Bench Hard.
  • Compute, cost, and efficiency: The paper does not disclose training compute, GPU-hours, parallel rollout counts, or cost. Provide end-to-end efficiency metrics (tokens processed, wall-clock time, energy/cost), and analyze the trade-off between increased assistant turns/running hours and performance gains.
  • Scalability beyond 6-hour horizons: Progressive RL tops out at 6H while tasks can run 12H+. Investigate feasibility and stability of training at 12H+ horizons, and whether additional stages yield diminishing or compounding returns.
  • Generalization breadth: Evaluation focuses on ML/code-heavy tasks. Test on other extremely long-horizon domains (web/GUI agents, scientific workflows beyond ML, multi-modal environments) to characterize domain generality and identify gaps in tool use or planning.
  • Overfitting to scaffolding/harness: Training is “tailored to PaperBench” with custom scaffolding. Measure how performance transfers to alternative scaffolds (e.g., OpenHands variants, Terminus configurations, different tool contracts) to assess harness-specific overfitting.
  • Partial rollout continuation and priority queue bias: The infrastructure prioritizes evaluation-set requests in a judge queue and continues partial rollouts across iterations. Analyze whether these policies skew training distribution, induce evaluation-set overfitting, or bias sample selection; consider randomized scheduling and fairness checks.
  • MLE-bench failure analysis: Many competitions show “-” (no validated submission). Perform systematic failure-mode analysis (dataset availability, network limits, environment mismatch, timeouts) and quantify how Research-Factory and training affect submission validity rates.
  • Impact of tool bans and scaffold rules: The paper bans the end_task tool early and imposes mandatory paper reading. Ablate these constraints to evaluate whether rules vs learned behavior drive performance, and identify minimal scaffold interventions for portability.
  • Base model dependency: The approach is instantiated on GLM-4.5-Air-Base (106B) without comparison to other bases (e.g., Qwen, Llama). Study portability and how foundation model priors (context length, tokenizer, tool-use training) interact with splitting SFT and progressive RL.
  • Scaling laws across model sizes: The paper contrasts 106B vs a 1T closed model but does not report internal scaling behavior. Produce scaling curves (e.g., 7B/34B/70B/106B) to characterize parameter-efficiency of long-horizon training and identify optimal size-cost regimes.
  • Long-term memory retention and forgetting: Assistant turns increase dramatically, but the paper does not measure retention of early information or systematic forgetting. Introduce diagnostics (e.g., recall probes, dependency tracking) to quantify information persistence across hundreds of turns.
  • Dataset release and quality controls: Trajectory dataset construction uses rejection sampling by judge scores, but thresholds and diversity controls are unspecified. Document acceptance criteria, balance across task types, and release artifacts (prompts, rubrics, trajectories) for community verification.
  • Safety, ethics, and licensing: Distillation from proprietary models and analysis of official code for rubric creation raise licensing and ethical questions. Clarify usage rights, document safety measures in sandboxed execution (e.g., package vulnerabilities, network policies), and assess potential misuse or leakage risks.
  • Comparison to alternative long-horizon methods: The paper does not benchmark against context-folding, memory-augmented agents, or large-scale asynchronous RL. Include head-to-head comparisons to isolate gains from trajectory-splitting SFT and progressive RL versus state-of-the-art training-free and training-based baselines.

Practical Applications

Immediate Applications

Below is a set of actionable use cases that can be deployed now, grounded in KLong’s methods (Research-Factory, trajectory-splitting SFT, progressive RL) and infrastructure optimizations (Kubernetes sandbox, scaffolding, rollout/judge scheduling).

  • Reproducibility Assistant for AI/ML Papers
    • Sector: Academia, R&D in Industry
    • What it does: Semi-automated reproduction of published ML papers (reading, implementation, experiment runs, rubric-based judging).
    • Tools/workflows: Research-Factory (paper collection, PDF→Markdown, decontamination, rubric tree construction), KLong agent, Kubernetes sandbox, rubric-based judging (gpt-oss-120b or equivalent).
    • Assumptions/dependencies: Access to papers/datasets; accurate rubric design; reliable judge model; sufficient compute/time windows; legal/privacy compliance for data use.
  • ML Engineering Autopilot for Experiments
    • Sector: Software/ML, Academia, Competitive ML (e.g., Kaggle)
    • What it does: Builds baselines, tunes hyperparameters, runs long experiments end-to-end; participates in competitions (MLE-bench tasks).
    • Tools/workflows: Preinstalled sandbox environments (torch, TensorFlow, scikit-learn, einops), trajectory-splitting SFT-enabled agent; experiment tracking; timeout scheduling.
    • Assumptions/dependencies: GPU/CPU availability; dataset licensing; long-running job orchestration; reliable package management.
  • Bug-Fixing and Issue Triage Bot
    • Sector: Software Engineering
    • What it does: Multi-hour debugging and patching across repositories; triage issues that require extended exploration and tests.
    • Tools/workflows: SWE-bench Verified scaffolding (e.g., OpenHands), KLong agent, CI/CD integration, test harnesses.
    • Assumptions/dependencies: Repository access and tests; guardrails for code changes; organizational approval for agent-driven commits.
  • SecOps Vulnerability Hunter
    • Sector: Cybersecurity
    • What it does: Long-horizon CVE triage, reproducing OSS-Fuzz cases, generating patches; sustained multi-step codebase exploration.
    • Tools/workflows: SEC-bench scenarios; sandbox isolation; rubric-based evaluation of exploitability/fix quality.
    • Assumptions/dependencies: Secure execution environments; access to vulnerable code and reproduction steps; policy-approved remediation workflows.
  • TerminalOps Automation
    • Sector: DevOps/IT Operations
    • What it does: Extended terminal interactions for environment setup, log analysis, data processing, and infrastructure maintenance.
    • Tools/workflows: Terminus 2 scaffolding; Kubernetes sandbox; prompt caching; improved context-length error handling.
    • Assumptions/dependencies: Appropriate credentials and access control; rate limits; stable connectivity; reliable audit logs.
  • Deep Research Assistant for Knowledge Work
    • Sector: Enterprise/Consulting
    • What it does: Long-run literature search, synthesis, and method replication to support decisions and due diligence.
    • Tools/workflows: Research-Factory’s search/evaluation agents; rubric trees; trajectory-splitting context handling for lengthy sessions.
    • Assumptions/dependencies: Source access (papers, code); quality rubrics; accurate judges; policies for handling proprietary materials.
  • Long-Horizon Agent Training Toolkit
    • Sector: AI/ML Engineering (Industry and Labs)
    • What it does: Adopts trajectory-splitting SFT and progressive RL schedules to upgrade existing agents for long-horizon tasks.
    • Tools/workflows: Prefix pinning, overlapping sub-trajectories, staged timeouts, group-relative advantage PPO, partial rollouts and priority judge queues.
    • Assumptions/dependencies: Base model availability; high-quality trajectories; compute budget; reliable judges; careful reward design.
  • Secure Agent Execution Platform (Sandbox-as-a-Pattern)
    • Sector: Platform/MLOps
    • What it does: Production-ready, Python-accessible, Kubernetes-based ephemeral sandbox with sidecar image management and scalable environments.
    • Tools/workflows: Ephemeral use-and-destroy containers; 25k+ Docker images management; environment preinstalls; prompt caching; tool bans.
    • Assumptions/dependencies: Cluster provisioning; cost control; isolation and observability; image maintenance pipeline.
  • Course Repro Labs and Auto-Grading
    • Sector: Education
    • What it does: Instructor-ready “reproduce-a-paper” labs with rubric-driven auto-grading of students’ replication attempts.
    • Tools/workflows: Research-Factory rubrics; sandboxed execution; KLong agent for assistance and evaluation.
    • Assumptions/dependencies: Curated paper sets; fairness and anti-cheating measures; compute quotas; accessible datasets.
  • Compliance and Audit for AI Claims
    • Sector: Policy/Enterprise Governance
    • What it does: Rubric-based reproducibility checks for vendor claims (models/papers), producing documented evidence for compliance audits.
    • Tools/workflows: PaperBench-style rubrics; independent judge models; standardized reports; audit logs from sandbox.
    • Assumptions/dependencies: Organizational buy-in; legal frameworks for verification; access to artifacts; reproducibility standards.
  • Long-Context Augmentation for Existing Agents
    • Sector: AI Product Development
    • What it does: Drop-in training approach to extend agents’ effective horizons without expanding model context windows.
    • Tools/workflows: Trajectory splitting (prefix pinning, progressive truncation, overlap), staged RL timeouts, partial rollouts.
    • Assumptions/dependencies: Availability of long trajectories; logging tooling; careful decontamination; data provenance.
  • RL Pipeline Resource Optimization
    • Sector: MLOps/Infrastructure
    • What it does: Reduces idle nodes and judge congestion via partial rollouts and priority queues for evaluations.
    • Tools/workflows: Rollout scheduling, priority judge queues, carry-over of unfinished rollouts, monitoring dashboards.
    • Assumptions/dependencies: RL orchestration platform; concurrency management; reliable evaluator throughput.

Long-Term Applications

The following use cases require further research, scaling, standardization, or cross-domain integration (e.g., multi-modal data, regulated environments, organizational adoption).

  • Autonomous R&D Agent (Read → Reproduce → Extend → Innovate)
    • Sector: Academia, Industrial R&D
    • What it could do: Systematically reproduce results, propose variations, run ablations, and generate new research contributions.
    • Tools/workflows: Expanded Research-Factory across disciplines; longer timeouts; asynchronous RL; multi-modal ingestion (figures/data).
    • Assumptions/dependencies: Access to lab resources and datasets; author cooperation; robust multi-modal reasoning; institutional acceptance.
  • Continuous Software Modernization Agent
    • Sector: Software/Enterprise IT
    • What it could do: Repository-scale upgrades (framework migrations, dependency refactors), sustained CI fixes over days/weeks.
    • Tools/workflows: Long-horizon planning; code-review integration; staged deployment; rollback safety nets.
    • Assumptions/dependencies: Organization-wide change management; comprehensive test coverage; strict guardrails; human-in-the-loop governance.
  • Reproducibility Certification Program
    • Sector: Policy/Standards, Publishing
    • What it could do: Standardized reproducibility scoring for publications/products; badges/certifications mandated by journals/regulators.
    • Tools/workflows: Community-maintained rubric trees; independent evaluator pools; registry (PaperBench-like) of certified artifacts.
    • Assumptions/dependencies: Consensus on standards; funding and governance; cross-institution interoperability; transparency requirements.
  • Secure Multi-tenant Agent Cloud (Sandbox-as-a-Service)
    • Sector: Platform/Cloud
    • What it could do: Public service for long-horizon agent execution with strong isolation, compliance, and auditability.
    • Tools/workflows: Hardened Kubernetes patterns; policy-enforced tool bans; tenant isolation; cost-aware scheduling.
    • Assumptions/dependencies: Regulatory compliance (e.g., SOC2/ISO); cost models; demand aggregation; SLAs for long-running jobs.
  • Healthcare Protocol Verification and Data Pipeline Automation
    • Sector: Healthcare
    • What it could do: Agent-driven reproducibility checks for clinical ML papers; validation of guidelines; long-run ETL and cohort building.
    • Tools/workflows: Domain-specific rubric trees; privacy-preserving sandboxes; secure EHR connectors; audit trails.
    • Assumptions/dependencies: HIPAA/GDPR compliance; stakeholder buy-in; curated medical ontologies; multi-modal medical data support.
  • Energy System Modeling and Grid Simulation
    • Sector: Energy/Utilities
    • What it could do: Long-horizon simulations for grid planning, demand forecasting, and policy impact analysis.
    • Tools/workflows: HPC integration; domain datasets and model libraries; trajectory-splitting for simulation logs.
    • Assumptions/dependencies: Access to high-fidelity models; compute scaling; safety and resilience requirements.
  • Robotics Experiment Orchestration
    • Sector: Robotics/Automation
    • What it could do: Multi-day experiment sequences in simulation and real-world; parameter sweeps; sustained evaluation loops.
    • Tools/workflows: Real-time control integration; safety constraints; multi-modal (vision/sensor) inputs; asynchronous RL extensions.
    • Assumptions/dependencies: Hardware access; strict safety governance; real-time constraints and fail-safes.
  • Financial Risk Modeling Audit and Backtesting
    • Sector: Finance
    • What it could do: Reproducible audit trails for strategies; long-horizon backtests; regulatory reporting generation.
    • Tools/workflows: Market data ingestion; rubric-based validation; secure sandbox; compliance controls.
    • Assumptions/dependencies: Licensed datasets; regulatory approval; strict traceability; controlled release processes.
  • MOOCs and University Programs with Auto-Graded Long Projects
    • Sector: Education
    • What it could do: Scalable grading of multi-week agentic projects (research reproduction, system builds) with transparent rubrics.
    • Tools/workflows: Rubric trees; sandbox orchestration; evaluator pools; fairness audits.
    • Assumptions/dependencies: Budget for compute; robust anti-cheating measures; bias mitigation in judging; inclusive access policies.
  • Training-as-a-Service for Long-Horizon Agents
    • Sector: AI Industry
    • What it could do: Managed service offering trajectory-splitting SFT, progressive RL, data distillation, and evaluation pipelines.
    • Tools/workflows: Dataset generation (Research-Factory), staged timeout coaching, priority evaluation queues; benchmark suites (PaperBench, MLE-bench, SWE-bench).
    • Assumptions/dependencies: GPU clusters; licensing for distilled data; client-specific domain rubrics; security/compliance.
  • Government Reproducibility Watchdog for AI
    • Sector: Public Policy/Regulation
    • What it could do: Independent verification of AI claims, standardized scoring, and public registries of reproducible artifacts.
    • Tools/workflows: Certified rubrics; standardized judge models; trusted execution environments; reporting APIs.
    • Assumptions/dependencies: Legal mandate; vendor cooperation; funding; protections against benchmark gaming.
  • Scientific Publishing with Executable Rubrics and Pre-Publication Replication
    • Sector: Academia/Publishing
    • What it could do: Authors submit rubric trees and replication bundles; agents verify claims prior to acceptance; readers reproduce results post-publication.
    • Tools/workflows: Author-facing rubric tooling; journal-integrated agent pipelines; artifact registries; DOI-linked reproducibility records.
    • Assumptions/dependencies: Editorial policy changes; tooling adoption; shared repositories; incentives for authors and reviewers.

Glossary

  • Addendum: Supplementary material appended to a paper or rubric to clarify evaluation criteria or constraints. "the evaluation agent designs the addendum and the rubric tree"
  • Agentic abilities: Capabilities of an LLM agent to plan, use tools, remember, and act autonomously in multi-step tasks. "activate basic agentic abilities of a base model"
  • Assistant turns: The number of agent responses or actions in a multi-turn interaction or trajectory. "assistant turns, 114.90\rightarrow732.70."
  • Context management: System-level methods to select, compress, and retrieve relevant context to handle long sequences beyond model limits. "context management"
  • Context window: The maximum number of tokens a model can attend to in a single input. "extremely long-horizon tasks inevitably exceed the context window"
  • Credit assignment: In RL, attributing returns or rewards to actions taken earlier in long sequences. "due to sparse rewards, high variance, and unstable credit assignment."
  • Decontamination: Removing test-set or benchmark content from training data to prevent leakage and overfitting. "We conduct careful decontamination to avoid including the papers in the test set."
  • Distillation: Training a model or creating data by imitating or extracting behavior from a stronger model. "long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking)."
  • Ephemeral use-and-destroy model: An infrastructure pattern where instances are short-lived and destroyed after use to improve scalability and security. "With an ephemeral use-and-destroy model, it efficiently supports 10,000+ concurrent instances."
  • Frontier judge model: A state-of-the-art evaluator model used to score outputs according to a rubric. "a frontier judge model J\mathcal{J} is used judge the replicated code based on the rubric tree K\mathcal{K}"
  • Group-relative advantage: An advantage estimate normalized across a group of trajectories to stabilize policy updates. "is the group-relative advantage across all nK(m)n \cdot K^{(m)} trajectories"
  • Judge model: The evaluation model that scores agent outputs against specified criteria or rubrics. "The official judge model of PaperBench is the closed-source o3-mini."
  • Kubernetes: A container orchestration platform for deploying and scaling workloads across clusters. "using Kubernetes"
  • KL regularization: Penalizing divergence between current and reference policies via Kullback–Leibler divergence to stabilize RL. "$\beta \, \mathbb{E}_{t} \Big[ \mathrm{KL} \big( \pi_\theta(\cdot \mid s_t) \,\|\, \pi_{\theta_{\mathrm{ref}(\cdot \mid s_t) \big) \Big]$" and "β weights the KL regularization."
  • Likelihood ratio: In PPO-style RL, the ratio of current policy likelihood to reference policy likelihood for an action. "rt(m,i,j)(θ)r_t^{(m,i,j)}(\theta) is the likelihood ratio"
  • Long-horizon tasks: Tasks that require many steps, extended interactions, and sustained reasoning. "Long-horizon tasks, such as bug fixing"
  • Overlapping sub-trajectories: Trajectory segments that share context to preserve continuity when splitting long sequences. "overlapping sub-trajectories for contextual continuity."
  • Partial rollouts: Truncated RL episodes used to reduce synchronization bottlenecks and improve resource utilization. "We mitigate this issue via partial rollouts and a priority-based judge queue."
  • Policy entropy: A measure of uncertainty in a policy’s action distribution; lower entropy indicates more confident actions. "assistant turns increase, and policy entropy decreases"
  • Prefix: A fixed initial context included at the start of each sub-trajectory to preserve global information. "we fix a prefix pp containing the task specification and paper-reading content"
  • Priority queue: An evaluation scheduling mechanism that processes higher-priority items first to avoid failures. "we introduce a priority queue to ensure that evaluations on evaluation set are processed with higher priority."
  • Progressive reinforcement learning: A training strategy that gradually increases task timeouts across stages to stabilize learning on very long tasks. "we propose a novel progressive reinforcement learning by gradually increasing the task timeout during training."
  • Reference policy: The previous or fixed policy used as a baseline in KL regularization and likelihood ratios. "$\pi_{\theta_{\mathrm{ref}$ is the reference policy from the previous training iteration"
  • Rejection sampling: Filtering generated trajectories by quality or score thresholds to keep only high-quality samples. "We conduct rejection sampling by checking the quality and judge scores of trajectories."
  • Rollouts: Executions of the agent policy in the environment that produce trajectories for training or evaluation. "Each rollout is split into K(m)K^{(m)} overlapping sub-trajectories"
  • Rubric tree: A structured hierarchy of evaluation criteria used to judge research reproduction quality. "rubric tree K\mathcal{K}"
  • Sandbox environment: An isolated, secure environment where the agent runs code and tools safely. "Within the sandbox environment, agent A\mathcal{A} iteratively plans"
  • Scaffolding: The auxiliary framework of tools and procedures that structure an agent’s interactions and workflows. "Scaffolding. We build on the basic scaffolding provided by PaperBench"
  • Sidecar container pattern: A deployment design where a helper container runs alongside the main application to provide auxiliary functions. "A sidecar container pattern manages 25,000+ Docker images"
  • Supervised fine-tuning (SFT): Training a model on input–output pairs to imitate desired behavior before RL. "standard supervised fine-tuning (SFT)"
  • Timeout: A maximum allowed duration for a task or rollout after which it is forcibly terminated. "a 12-hour timeout"
  • Trajectory-splitting SFT: An SFT technique that splits long trajectories into overlapping chunks with a fixed prefix to fit within the context limit. "we propose a new trajectory-splitting SFT"
  • Wall-clock time: Real elapsed time measured externally (not steps or tokens), used to bound training or task duration. "bounds the maximum wall-clock time allowed for a task at stage mm."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 164 likes about this paper.