ROME Model in ALE
- ROME Model in ALE is an open-source agentic LLM built on a sparse 30B MoE-transformer architecture that enables dynamic tool-use and multi-turn interactions.
- It integrates ALE’s components—ROLL, ROCK, and iFlow CLI—to orchestrate data synthesis, RL optimization through IPA, and secure, reproducible trajectory generation.
- Empirical evaluations demonstrate ROME’s superior performance on programming and agentic tasks, achieving faster convergence and reliable results across diverse benchmarks.
The ROME model, formally "ROME is Obviously an Agentic Model," is an open-source agentic LLM explicitly designed for multi-turn, tool-using, real-world agentic tasks, and is developed and grounded within the Agentic Learning Ecosystem (ALE). ALE provides a unified infrastructure for agent LLM training and deployment, facilitating large-scale data pipeline integration, environment roll-outs, policy optimization, and end-to-end reproducibility. ROME is distinguished by its architecture (30B MoE-transformer, ~3B activated parameters per token), an IPA (Interaction-based Policy Alignment) RL algorithm, a principled multi-tier data composition pipeline, and close coupling of model, environment, and context via ALE's ROLL, ROCK, and iFlow CLI components. ROME demonstrates leading benchmark performance on programming, agentic tool-use, and multi-domain terminal environments, establishing a canonical workflow for building and evaluating large agent models in an open ecosystem (Wang et al., 31 Dec 2025).
1. Agentic Learning Ecosystem (ALE) Infrastructure
ALE's architecture is comprised of three tightly integrated subsystems:
- ROLL (Reinforcement Learning Optimization for Large-Scale Learning): A distributed RL post-training framework decomposing agentic optimization into fine-grained rollout (concurrent LLM-token generation, environment steps, reward calculation), asynchronous training through staleness-bounded sample buffers, and dynamic GPU multiplexing for minimizing idle time. ROLL is responsible for orchestration of large-scale RL data generation, synchronizing weights across workers post-optimization, and scheduling jobs with complete traceability.
- ROCK (Reinforcement Open Construction Kit): A secure, sandboxed execution and validation environment manager. ROCK supports multi-process, multi-language tool execution within isolated Docker containers, standardizing the RL interaction API (provision/reset/step/close) and maintaining reproducibility through environment image registries (EnvHub). ROCK features a ModelProxyService for transparent LLM-inference routing via iFlow CLI for strict context fidelity, fault isolation, and optional multi-agent orchestration.
- iFlow CLI: The agent orchestrator and context engineering engine, responsible for context construction, compression, memory management, retrieval, task-spec injection, loop and workflow management, and tool invocation. All inference and RL trajectory interactions are managed through iFlow CLI, enabling persistent memory (e.g., TODO files), explicit workflow specs, and user-supplied tool suites (via MCP interfaces).
ALE's components enable joint control over model optimization, environment grounding, and contextual prompt/interaction engineering, ensuring reproducibility, modularity, and scalability of agentic LLM training and inference (Wang et al., 31 Dec 2025).
2. ROME Model Architecture and Grounding
ROME is instantiated as a 30B-parameter sparse MoE transformer architecture, derived from Qwen3-MoE, with 48 transformer layers (every other replaced by a MoE block with 64 experts), and approximately 3B parameters activated per token by expert gating. Each layer contains multi-head self-attention (32 heads, ) and MoE-FFN sublayers ().
ROME's model development pipeline is strictly routed through ALE: pretraining and RL hyperparameters are controlled by ROLL's “Cluster” abstraction, and all interaction (generation in trajectory environments, evaluation rollouts, inference) is run in ROCK's sandboxed environments with context engineered by iFlow CLI to guarantee prompt and tool call fidelity.
The architecture supports native agentic actions (tool calls, command-line environment interaction, internal memory updates) through the ALE interface, with structured context specification, end-to-end reproducibility, and modular system prompt configuration (Wang et al., 31 Dec 2025).
3. Data Composition and Trajectory Pipeline
ROME leverages a two-tier data synthesis protocol:
Tier I: Basic Data (≈100B tokens)
- Curated from ~1M high-star GitHub repositories: project concatenation, filtered issues, and PRs.
- Fine-grained tasks: minimal file localization, code repair via search-and-replace, unit-test synthesis, PR comment-response chains.
- Chain-of-Thought augmentation, enforced by rejection sampling, ensures label/trace fidelity.
Tier II: Agentic Data (≈30B tokens)
- Synthetic tool-use dialogues with explicit tool-call execution traces and feedback across diverse settings (single-/multi-turn, single-/multi-tool).
- Programming-centric data generated via a four-stage multi-agent pipeline: Explore Agent (drafting), Instance Builder (self-play with build/test and ROCK validation), Review Agent (LLM-judgement plus functional/coverage checks), and Trajectory Agent (final multi-turn scaffolds).
- Robust multi-stage filter (syntax, LLM-based relevance, sandbox execution, expert audits) ensures high-fidelity, executable agentic traces.
Data validation emphasizes semantic, tool-grounded behavior, forming a trajectory corpus (>1M) supporting stable RL optimization and generalization beyond fixed tool regimes (Wang et al., 31 Dec 2025).
4. Policy Optimization: Interaction-based Policy Alignment (IPA)
ROME's RL fine-tuning is governed by the IPA algorithm, which defines a Chunked MDP where each tool-call interaction chunk, , forms an atomic RL action, as opposed to standard token-level RL.
- Chunking: A trajectory is segmented into chunks , each ending with a tool invocation or episode termination.
- Rewards: Sparse rewards are administered only for complete, successful trajectories.
- Discounted Return: Each chunk is assigned , with importance weighting (), and masked if policy shift is excessive.
- Gradient Estimation: REINFORCE gradients are applied chunk-wise, splitting positive/negative outcomes and clipping IS ratios.
- Chunk-level Resampling and Sequential Rollback: At "crucial forks," expert states are reset and rollouts resumed from online to prevent zero-gradient collapse. Mixes with imitation learning updates.
- Total Loss:
IPA yields stable gradients, enhanced credit assignment, and faster convergence relative to token-level RL, particularly on long-horizon, sparse reward agentic tasks (Wang et al., 31 Dec 2025).
5. ROLL: Distributed RL and Post-Training Optimization
ROLL coordinates RL optimization by atomic parallelism across LLM generation, environment interaction, and reward computation. Key mechanisms include:
- Fine-grained due diligence through sample-level parallelization and a bounded staleness asynchronous buffer.
- Dynamic Train–Rollout multiplexing to maximize GPU utilization by reallocating resources from idle rollout workers to active trainers.
- Weight synchronization ensures global policy coherence across all workers after each RL epoch.
These mechanisms reduce post-training RL wall time by 30–40% without degrading final policy quality, underpinning scalable agentic RL (Wang et al., 31 Dec 2025).
6. ROCK: Sandboxed Trajectory Generation and Evaluation
ROCK orchestrates the sandboxed environment required for agentic trajectory generation:
- Each RL step triggers a sandbox instance (Docker-based) with egress restrictions (Rocklet).
- All model inference is routed via ModelProxyService → iFlow CLI for context completeness.
- EnvHub guarantees environment reproducibility; supports multi-agent workflows and fault isolation.
- Trajectory traces are stored with fine-grained tool call/output/reward/error codes for further RL and IPA error masking.
This infrastructure separates code/execution context from model, supporting rigorous, secure, and reproducible benchmarking (Wang et al., 31 Dec 2025).
7. iFlow CLI: Context Engineering and Agent Orchestration
iFlow CLI manages:
- Memory and Retrieval: Persistent todo lists and semantic retrieval from project knowledge bases or vector DBs.
- Context Compression/Pruning: Lossy and lossless context reduction for window fit.
- Workflow/Spec Injection: Modular system prompt/workflow configuration, supporting custom toolchains (via MCP) and specialized task handling.
- Task Decomposition: Sub-agent partitioning for modular subtask context isolation and orchestration.
Optimized context fit yields 20–30% inference speedup by eliminating redundant token consumption and maximizing context-window utilization (Wang et al., 31 Dec 2025).
8. Empirical Evaluation and Benchmark Results
ROME's performance is validated across several agentic and programming benchmarks:
| Benchmark | ROME | Qwen3-30B | Devstral-24B | GPT-OSS-120B | Gemini-2.5 | GLM-4.5 | GPT-5-Mini | Avg |
|---|---|---|---|---|---|---|---|---|
| Terminal-Bench 1.0 | 41.50 | 28.50 | 28.33 | 31.25 | 23.75 | 30.00 | 33.75 | 32.68 |
| Terminal-Bench 2.0 | 24.72 | 13.48 | 18.20 | 21.12 | 16.40 | 17.30 | 20.97 | 19.39 |
| SWE-Bench Verified | 57.40 | 46.33 | 51.87 | 43.93 | 28.73 | 56.20 | 59.30 | 50.71 |
| SWE-Bench Multilingual | 40.00 | 30.00 | 27.00 | 34.84 | 11.50 | 38.16 | 49.67 | 35.81 |
| Terminal-Bench Pro (pub) | 40.50 | 26.00 | 32.17 | 32.00 | 23.67 | 33.00 | 34.75 | 32.73 |
| Terminal-Bench Pro (priv) | 21.50 | 11.33 | 17.00 | 27.83 | 15.17 | 15.83 | 29.50 | 20.53 |
| Average | 37.60 | 25.94 | 29.10 | 31.83 | 19.87 | 31.75 | 37.99 | 32.14 |
ROME matches or exceeds the performance of comparable 30B–120B models on all Terminal-Bench and SWE-Bench variants, and demonstrates robust generalization to tool-use and agentic evaluation (TAU²-Bench, BFCL-v3, MTU-Bench).
A plausible implication is that ALE's end-to-end integration of modeling, data, RL, and environment delivers consistent, scalable agentic performance with models substantially smaller than bespoke LLMs, suggesting a decisive advantage of the ALE pipeline for open-source, high-fidelity agentic development (Wang et al., 31 Dec 2025).