SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Published 4 Feb 2026 in cs.CL, cs.AI, and cs.LG | (2602.04811v1)

Abstract: True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that closed-book SFT significantly outperforms RL methods in encoding novel, obfuscated API knowledge.
It introduces SE-Bench, a diagnostic platform that obfuscates NumPy functions to measure pure knowledge internalization in self-evolving agents.
Experimental results reveal that withholding documentation during training is essential for robust parametric memory and autonomous learning.

SE-Bench: A Diagnostic Platform for Knowledge Internalization in Self-Evolving Agents

Motivation and Problem Setting

Self-evolution—defined as a model's capacity to autonomously acquire, internalize, and consolidate novel knowledge through interaction with its environment—is foundational to AGI. Existing methods evaluate fragments of this ability, such as incremental adaptation or tool-use, but fail to cleanly disentangle pure knowledge internalization from confounds like prior exposure and reasoning complexity. The two core obstacles are: (1) pre-training contamination, which precludes certainty that an agent’s knowledge is not memorized from training data; and (2) entanglement with task complexity, where failures may reflect reasoning limitations rather than deficits in experiential knowledge assimilation.

SE-Bench addresses this with an adversarially clean, synthetic domain: it obfuscates the well-known NumPy API and its documentation into a novel, randomized package (ZWC), thus precluding any leakage from public corpora or base knowledge. Agents are required to internalize ZWC's logic and API purely through exposure to its documentation during training, but must subsequently complete tasks in a closed-book regime, operating without documentation at test-time.

Figure 1: Overview of the SE-Bench construction pipeline. Steps include function and docstring obfuscation (creating ZWC), automatic task generation, and consensus filtering with multiple LLMs plus human verification.

This experimental design cleanly isolates the target phenomenon: successful task completion at test time is algorithmically trivial given knowledge of ZWC, but mathematically impossible without it due to the randomized API. Thus, performance directly measures parametric knowledge internalization capacity.

Benchmark Construction and Evaluation Protocol

The SE-Bench construction pipeline has three critical phases:

Obfuscation: 268 NumPy functions are systematically renamed to semantically void ZWC identifiers, and all documentation is rewritten to refer only to ZWC, erasing any original NumPy tokens.
Task/Case Generation: State-of-the-art LLMs generate single- and multi-function programming tasks along with ground-truth test cases, mapped to ZWC's API.
Filtering: Three distinct closed models (Qwen3-Coder-480B, Gemini-2.5-Pro, and GPT-OSS-120B) must each solve the generated tasks using the original NumPy API. Only tasks solved by all three are retained, with additional human filtering for clarity.

Training is conducted with access to documentation; evaluation is strictly closed-book. The final 1,417-task SE-Bench comprises both "Single" (single-API) and "Multiple" (multi-step compositional) splits.

The evaluation metric strictly enforces three constraints: (a) all test cases must pass; (b) abstract syntax tree analysis must show exclusive use of ZWC; and (c) the original NumPy must never be imported.

Experimental Results and Major Empirical Insights

Extensive baselines from three families are evaluated:

Memory-based: External memory retrieval methods (ACE and ExpeL)
Parameter-update: Supervised fine-tuning (SFT), reinforcement learning (RL), and hybrids, each run under "Open" (documentation present during update) and "Closed" (documentation withheld) regimes
Self-play/autonomous: Fully self-generated data with downstream SFT or RL

Key results:

Zero-shot accuracy using ZWC without documentation is 0.0%, confirming the benchmark’s immunity to training leakage.
Closed-SFT and Closed-SFT-RL substantially outperform Open variants and memory retrieval, achieving up to 54.4% accuracy, but still trail "oracle" NumPy usage (>90%).
All RL-only and Open-book SFT methods fail completely (0% accuracy), even matching state-of-the-art methods such as Absolute Zero.
Self-play is viable if coupled with SFT, but not RL: Closed-SFT on self-generated curricula achieves 22.5% accuracy, whereas RL again yields 0%.

Figure 2: Withholding documentation during SFT (Closed-SFT) induces true parametric internalization; Open-SFT fails without documentation, exposing brittle context dependence.

Analysis of Optimization Dynamics

SFT and RL exhibit starkly divergent optimization characteristics for knowledge internalization.

Open-Book Paradox: When documentation remains accessible during parameter update, no true internalization occurs. Test-time performance collapses without documentation, exposing a strict reliance on context.
Closed-Book Superiority: Withholding documentation during update enforces genuine parametric memory—all context-free performance is strictly attributed to weights, not short-term memory.
RL Gap: Classic RL methods (PPO, GRPO) fail to internalize obfuscated APIs due to two factors:
- PPO’s clipping precludes large, informative weight updates needed to encode novel factual knowledge
- Use of normalized advantages introduces negative gradients early, erasing transient, fragile associations before they consolidate
- Ablation shows that SFT-style settings for RL (positive-reward only, large learning rate and batch size, no clipping) can partially close the gap, but as soon as standard RL mechanisms are reintroduced, learning collapses

Figure 3: Distribution of error types. Closed-SFT is dominated by API hallucinations; Hybrid Closed-SFT-RL suppresses hallucinations and encourages robust fallback strategies.

Autonomous Self-Play and Internalization

A crucial question is whether agents can autonomously generate and learn from their own (potentially noisy) curricula. Direct self-play with RL (Absolute Zero) fails entirely, mirroring the RL Gap. However, using SFT on self-generated tasks recovers significant non-trivial performance, indicating that the optimization method, not data quality, is the bottleneck for self-evolving internalization.

Behavior Consolidation and Error Shift

Closed-SFT yields "probabilistic" memory, where agents often fill in gaps with plausible hallucinations (e.g., inventing missing ZWCArray attributes). When RL is applied after SFT ("Closed-SFT-RL"), there is a dramatic shift: RL prunes uncertain, lazy guesses and favors disciplined, more robust code, frequently replacing implausible calls with primitive constructs (loops, built-ins). However, RL is unable to repair fundamental memorization errors or incorrect API signatures inherited from SFT—these persist.

Diversity Effects

Figure 4: Impact of question vs. response diversity on knowledge internalization. Diverse question exposure is far more consequential than response diversity.

The diversity of training questions is crucial for efficient internalization—merely increasing solution diversity per question has minor effects relative to the breadth of encountered problem templates.

Self-Play Internalization Dynamics

Figure 5: Self-play SFT (Closed-SFT $_\text{self}$ ) outperforms open-book self-play, conforming to the Open-Book Paradox even in a fully autonomous regime.

The Open-Book Paradox persists under self-generated curricula: documentation exposure during update inhibits generalization to documentation-free test-time.

Implications and Future Directions

SE-Bench exposes and quantifies multiple previously underexplored phenomena essential to the development of robust self-evolving machine intelligence:

Closed-book learning as a requirement: Accessible context during parameter update is actively detrimental to long-term retention and parametric compression.
Limitations of RL for factual internalization: Standard RL protocols optimize utilization and behavioral adaptation, but cannot efficiently encode novel, structured knowledge unless modified to remove clipping and negative gradients.
Autonomous learning is viable, contingent on optimization: Self-generated data is sufficient for parametric internalization in LLMs—charting a clear path for self-guided AGI learning, if architectural and algorithmic changes (towards SFT-style updates) are implemented.
Behavior refinement pipeline: SFT can be leveraged for acquisition and RL for consolidation and error pruning, but not for de novo retention.

Looking forward, SE-Bench offers a minimal diagnostic test for self-evolution—a "needle-in-a-haystack" for parametric memory. It will serve as an important baseline for further advances in continual learning, hybrid update protocols, and the mechanistic elucidation of memory consolidation in large-scale architectures.

Conclusion

SE-Bench establishes an empirically robust, disentangled environment for evaluating and studying knowledge internalization in AI agents. Through rigorous ablation and diverse experimental settings, it demonstrates that parametric memory formation is strictly context-dependent, classic RL is inadequate for factual acquisition, and closed-book SFT (possibly seeded with autonomous data) is essential for robust self-evolving agents. This work provides a necessary lower bound for future research on self-improving AI and informs both the limitations of current training paradigms and the prospects for open-ended, autonomous machine learning (2602.04811).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper is about teaching and testing AI models to truly “learn and remember” new knowledge, not just look things up. The authors build a simple but clever test called SE-Bench. It takes a well-known Python library called NumPy (a toolbox for math with arrays) and hides it behind made-up names, turning it into a “new” package. Models must learn these new names and how to use them during training, then solve coding tasks later without any notes. If they really internalize the knowledge, the tasks are easy; if not, they’re impossible.

Key Questions

The paper asks a few clear questions:

Can we fairly measure whether an AI actually learned something new, instead of just recalling something it already knew?
Does training “with notes open” (having the documentation visible) help or hurt long-term memory?
Can reinforcement learning (RL)—learning by trial-and-error with rewards—make models internalize new facts?
Can a model teach itself with its own practice problems?

How SE-Bench Works (Explained Simply)

Think of a workshop full of tools with familiar names (like “hammer,” “wrench,” “screwdriver”). Now imagine we rename them to nonsense words like “kocito,” “brifla,” and “zendri.” The tools still do the same things, but all their labels and manuals use the new names. We then:

Build the “fake” package by wrapping NumPy so that its functions have random, nonsense names. Inputs and outputs are also wrapped, so the model can’t secretly call the old tools.
Generate simple coding problems that use these functions (like “find the average” or “sort a list”) and lots of test cases.
Filter and check all problems with multiple strong models and humans to make sure they’re clear and easy if you know the new names.
Train models with access to the documentation (a “cheat sheet” showing what each nonsense name does).
Test models later without the documentation. The only way to succeed is to remember which nonsense name matches which NumPy function.

To evaluate solutions, they do more than check the output. They also verify that the code actually used the obfuscated package and didn’t import the real NumPy (no cheating!).

Main Findings and Why They Matter

Here are the major discoveries from the experiments:

The Open-Book Paradox: Training with the documentation visible can make memory worse. When models see the docs during training, they rely on them and don’t store the knowledge in their “brains.” Removing the docs during the parameter updates (“closed-book training”) forces the model to truly remember. Result: Closed-book training worked much better than open-book.
The RL Gap: Standard reinforcement learning (like PPO/GRPO) failed to internalize the new names and facts. Two RL “safety” features caused trouble:
- Clipping (which limits how much the model can change with each update) prevents the big shifts needed to memorize new names.
- Negative gradients (penalties for below-average attempts) can wipe out fragile new memories before they stick.
Self-Play Works with SFT, Not RL: If a model creates its own practice problems and solutions (self-play), it can learn—but only if you train it with supervised fine-tuning (SFT), which is like learning from worked examples. Using RL on self-play problems didn’t help. Using SFT did.
Best Combo: First do closed-book SFT to learn and store the new facts. Then add a bit of RL to make the model’s code more robust (it reduces “guessing” and encourages reliable techniques), even though RL alone can’t teach the facts from scratch.

These results are important because they show a clean way to test whether an AI can truly learn new knowledge and highlight training methods that actually help models remember.

Implications and Potential Impact

A simple, fair “unit test” for self-evolving AI: SE-Bench shows clearly whether a model can internalize knowledge. If the model learned the secret names, tasks are easy; if not, it fails. This removes confusion about “did it learn, or did it just recall something it knew before?”
Training advice:
- Use closed-book SFT to make models store new facts in their parameters.
- Be careful with standard RL for learning new facts; it’s better at polishing behavior after the facts are learned.
- Self-generated practice can help, as long as you use SFT to learn from it.
Research direction: SE-Bench offers a safe, controlled environment to study how AI learns and remembers. It can guide the community toward better training strategies for future self-evolving agents.

In short, SE-Bench gives a clear way to check whether an AI truly learns. It shows that “learning with the book closed” helps memory, RL isn’t great at learning new facts, and self-play can work if paired with the right training method.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of what remains uncertain or unexplored, framed to guide actionable future work:

External validity beyond NumPy renaming: Does SE-Bench’s conclusions (Open-Book Paradox, RL Gap, SFT/RL roles) hold for other libraries (e.g., Pandas, PyTorch), programming paradigms (stateful/OOP APIs), languages (JavaScript, Java), and non-code domains (textual knowledge, mathematics, tools)?
Novel-concept learning vs symbol remapping: SE-Bench evaluates a 1-to-1 identifier remap of known semantics. How well do methods internalize genuinely new concepts/algorithms whose semantics are not present in pretraining?
Depth of compositionality: Current multi-function tasks test shallow composition. How does performance scale with compositional depth/branching (e.g., 3→10+ function chains, nested/broadcasting semantics, dtype/shape edge cases)?
Continual and lifelong learning: What happens when agents sequentially learn multiple obfuscated packages? Measure catastrophic interference/forgetting, cumulative capacity, and retention over time and further training.
Sample efficiency and scaling laws: How many exposures per function are required for stable internalization? Provide learning curves vs number of examples, model size, and training steps; estimate scaling exponents.
Generalizability across model families and sizes: Are the core findings robust across Llama/Mistral/GPT/CodeLlama families and larger models (≥70B)? Do instruction-tuned vs base models behave differently?
On-policy, closed-book RL: The RL experiments are off-policy (rollouts with doc, training without) and omit importance sampling. Can properly on-policy, closed-book RL (collecting trajectories without docs after warm-up) succeed?
Off-policy correction validity: Ignoring the importance ratio (due to vanishing numerator) biases gradients. Can practical off-policy estimators (e.g., conservative corrections, behavior cloning regularizers) enable RL internalization?
RL algorithmic variants: Does the RL Gap persist for algorithms that avoid PPO clipping and negative advantages (e.g., unclipped REINFORCE with ReLU advantages, NPG/TRPO with enlarged trust regions, V-MPO, Phasic PG, P3O, UPGO)? What is the role of KL regularization magnitude?
Reward shaping and credit assignment: The reward is sparse and binary. Do shaped rewards (per-test-case partial credit, correctness of API choice, signature validation, static-analysis-based partials) enable RL to internalize mappings?
Mechanistic analysis of the RL Gap: Provide gradient-level/mechanistic evidence (e.g., logit tracking for ZWC tokens, activation patching, causal tracing) that clipping/negative advantages prevent probability mass transfer necessary for new-token adoption.
Verification robustness and bypasses: AST checks may miss dynamic imports, reflection, exec/eval, import, or solving via pure Python without ZWC reliance. Quantify and mitigate bypasses via runtime sandboxing, syscall/module tracing, taint analysis, and docstring/source introspection blocking.
Documentation leakage channels: At test time, can models access zwc.doc, inspect.getsource, help(), or other introspection APIs? Enforce/verify that such channels are blocked and report leakage rates.
Self-play data quality: Quantify correctness/coverage/diversity of self-generated tasks and tests, their error modes, and how noise impacts SFT. Explore automatic filtering (cross-model consensus, mutation testing, fuzzing), and curriculum strategies.
Signature memorization and name recall: The dominant residual errors include function-name and parameter-signature mistakes. Do targeted curricula (flashcards, contrastive signature drills, auxiliary losses on API tokens) improve these failure modes?
Hybrid learning design: The hybrid Closed-SFT-RL improves utilization but not factual corrections. Can alternating schedules (SFT for acquisition, RL for consolidation, retrieval-assisted remediation for signature errors) further reduce hallucinations?
Impact of closed-book training on general capabilities: Does Closed-SFT harm broader instruction following/coding ability or induce catastrophic forgetting elsewhere? Evaluate before/after on standard code and reasoning benchmarks.
Robustness to noisy/misleading docs: How do methods respond to partially incorrect, outdated, or adversarial documentation during training? Can models detect, correct, or remain robust to doc noise?
Cross-obfuscation generalization: After learning one random mapping, can the model adapt more quickly to a new mapping (meta-learning effects)? Does it learn general mechanisms for API mapping vs memorizing specific tokens?
Memory vs parametric internalization: Memory-based methods help but are far from perfect. Can explicit memory + distillation (retrieve mapping at train time, then SFT into weights) outperform either alone? What retrieval and consolidation policies work best?
Online, gradient-free adaptation: Can agents internalize mappings via test-time adaptation without gradient updates (e.g., iterative reasoning, external memory writes, tool-augmented self-explanation), and retain it across episodes?
Benchmark breadth and difficulty control: Provide systematically parameterized hardness knobs (e.g., compositional depth, broadcasting quirks, dtype corner cases), formally document coverage, and report per-category performance.
Reproducibility and variance: Many results are averaged over 5 rollouts/pass@64. Report variance across seeds, mapping randomness, and model sampling to ensure claims are statistically robust.
Security/threat model specification: Clearly state the adversary model (what is allowed/forbidden at test time), enumerate potential bypass vectors, and provide a hardened reference evaluator.
Parameter-efficient training: Do LoRA/IA3/QLoRA reproduce Closed-SFT gains? What are memory/compute trade-offs for internalization at scale?
Extending to multi-modal/tool ecosystems: Apply SE-Bench-like obfuscation to tool APIs, bash utilities, HTTP endpoints, or vision/audio libraries to test internalization beyond code-only array operations.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that leverage SE-Bench’s dataset, evaluation protocol, and empirical findings (Open-Book Paradox, RL Gap, Self-Play viability). Each item notes sectors, potential tools/workflows, and key dependencies that affect feasibility.

Bold release-gating for “self-evolving” claims — software, AI vendors, policy/compliance Use SE-Bench as a standardized unit test for knowledge internalization when vendors claim “continual learning” or “self-evolution.” Require near-ceiling scores on Single-Function tasks and strong multi-function generalization before shipping agentic features. Tools/workflows: Integrate SE-Bench into CI/CD; pass/fail gates akin to security test suites; AST-based adherence checks to prevent bypassing constraints. Dependencies/assumptions: Access to post-training evaluation; buy-in from product and risk teams; acceptance of SE-Bench as a recognized benchmark in procurement/standards.
Closed-book SFT training recipe for internalization — software/DevTools, MLOps Adopt the “Closed-Book Training” recipe: use rich documentation/hints during trajectory collection, strip them during parameter updates. This reliably compresses new API knowledge into weights (resolving context-dependence). Tools/workflows: Data pipeline that logs doc-rich rollouts and automatically redacts docs for SFT; fine-tune orchestration (e.g., LoRA/QLoRA) with documentation-off batches. Dependencies/assumptions: Fine-tunable base models and compute; careful prompt/data engineering to avoid inadvertently leaking doc content into training text.
Internal SDK/API onboarding for code assistants — software/platform engineering Fine-tune internal code assistants to truly learn proprietary SDKs/microservices rather than rely on in-context docs. Use SE-Bench-like obfuscation and AST-enforced evaluation to detect hallucinations and confirm weight-level internalization. Tools/workflows: Obfuscation wrappers for internal libraries; AST-based linters to forbid external imports or banned namespaces; regression dashboards on single- vs multi-function tasks. Dependencies/assumptions: Clean, high-quality internal docs; permission to fine-tune on company data; guardrails for IP handling.
Positive-only “internalization RL” ablation for research/dev — software/AI research For teams experimenting with RL, replicate the paper’s findings: remove PPO-style clipping and negative advantage if the goal is factual internalization, or preferentially use SFT for that phase. Use RL after SFT for utilization/robustness gains, not for first-time memorization. Tools/workflows: Toggleable RL objectives (log-prob vs probability ratios), advantage shaping (binary/positive-only), curriculum schedulers that shift from SFT to RL. Dependencies/assumptions: RL infrastructure (GRPO/PPO variants), careful monitoring to mitigate instability when removing safety mechanisms (clipping).
Self-play SFT for autonomous curriculum building — software/AI research, education Let the model draft its own tasks and tests from documentation, then apply Closed-Book SFT on that self-generated data (the paper’s “Closed-SFT_self”). This reduces reliance on curated labels while still achieving meaningful internalization. Tools/workflows: Self-play task/test generators, quality filters (consensus among multiple models), automatic doc redaction during updates. Dependencies/assumptions: Base model capable of generating minimally valid tasks; basic filtering/verification; compute to iterate.
Vendor-neutral benchmark extensions for other stacks — software, cloud, robotics Reuse the obfuscation + wrapper + AST verification pattern to build SE-Bench variants for Pandas, PyTorch, ROS/robotics drivers, cloud SDKs, or internal tools. This creates durable, leakage-proof evaluations for new domains. Tools/workflows: Domain-specific wrappers and object proxies (like ZWCArray); doc translators; AST or bytecode inspectors tailored to the target language. Dependencies/assumptions: Stable API surfaces; language-specific static analysis; minimal performance overhead from wrappers for realistic testing.
Hallucination diagnostics and remediation — software/DevTools, safety Use SE-Bench error taxonomy (e.g., attribute and function hallucination, signature misalignment) to create targeted linting and coaching for code assistants. Post-SFT RL can reduce “lazy” guesses by encouraging robust, grounded implementations. Tools/workflows: Error classifiers, example rewriters, fallback strategies to primitives, run-time checks for banned namespaces. Dependencies/assumptions: Telemetry of generation errors; safe RL configurations for consolidation (not memorization).
Compliance-style audits for data leakage and provenance — policy/safety, regulated industries Demonstrate absence of pretraining leakage: zero-shot on obfuscated APIs should be at or near 0%. Use this as part of model provenance documentation in regulated procurement. Tools/workflows: SE-Bench zero-shot tests; tamper-proof logs; third-party audit bundles. Dependencies/assumptions: Willingness to disclose test results; acceptance by auditors/regulators.
Course/lab modules on “internalization vs. in-context use” — academia/education Teach modern post-training principles with hands-on labs: closed-book SFT vs open-book SFT, RL ablations, self-play SFT. Students replicate and interpret the Open-Book Paradox and RL Gap. Tools/workflows: Course notebooks, lightweight SE-Bench subsets, small open LLMs (e.g., 1–7B) to keep costs manageable. Dependencies/assumptions: Compute quotas; permissive model licenses for teaching.
CI checks for tool-use adherence — software/tooling Enforce company standards (e.g., disallow direct NumPy in certain repos). AST-based checks derived from SE-Bench’s evaluator ensure model-generated code uses approved APIs. Tools/workflows: AST/IR-level policy engines, pre-commit hooks, PR bots. Dependencies/assumptions: Language and framework coverage; developer acceptance of automated gating.

Long-Term Applications

These rely on further research, scaling, or domain adaptation beyond the current NumPy-based SE-Bench. They outline product and policy directions likely to emerge as the field matures.

Enterprise “Knowledge Internalization Platform” — enterprise software, knowledge management A service that ingests proprietary docs/SOPs, creates obfuscated curricula, runs closed-book SFT, and certifies internalization before deploying updated assistants across orgs. Potential product: Managed “Closed-Book Trainer” with self-play SFT, audit logs, and pass/fail gates. Dependencies/assumptions: Robust domain-general obfuscation and verification; strong privacy guarantees; integration with enterprise IAM and data governance.
Continual learning pipelines with safe consolidation loops — MLOps, safety Production pipelines that schedule periodic closed-book SFT on new corpora, followed by carefully constrained RL for consolidation and hallucination reduction. Potential workflow: SFT (internalize) → RL (consolidate/utilize) → regression audits (SE-Bench-like tasks). Dependencies/assumptions: Catastrophic forgetting mitigation; safe RL variants; automated drift detection.
Sector-specific internalizers for rapidly changing rules — finance, healthcare, legal Assistants that internalize new regulations, coding standards, or clinical protocols while providing auditable proof of internalization (no reliance on always-visible context). Potential product: “Policy Internalizer” with change-diffing, test generation, and certification artifacts for auditors. Dependencies/assumptions: High-quality, up-to-date canonical sources; rigorous approval workflows; sector-specific liability frameworks.
Robotics/tooling adapters for new hardware APIs — robotics, manufacturing Benchmarks and training kits that obfuscate hardware drivers or tool APIs so robots can truly learn new capabilities and compose them safely. Closed-book SFT ensures internalization; post-SFT RL enhances robust execution. Potential product: “Robot API Learner” SDK with simulator-based AST-equivalent constraints and safety interlocks. Dependencies/assumptions: High-fidelity simulators; safe transfer to real hardware; fail-safe design.
Education platforms that teach models and humans together — education/edtech Co-learning systems where the platform generates curricula (for both students and models), uses closed-book phases to enforce mastery, and provides explainable internalization diagnostics. Potential product: Dual-agent tutors that can both learn new syllabi and prove mastery (benchmarks per module). Dependencies/assumptions: Pedagogical validation; robust, interpretable mastery indicators.
Industry-wide standards for “self-evolution” claims — policy/standards Formal definitions and tests (SE-Bench-style) embedded in certification programs, procurement templates, and regulatory guidance for autonomous learning systems. Potential tools: Standardized Internalization Scorecard; third-party test suites; compliance marks. Dependencies/assumptions: Multi-stakeholder consensus; portability to non-code domains; governance for updates.
Cross-language and multi-modal internalization benchmarks — software, multimodal AI Extend the obfuscation-plus-verification recipe to Java/C++ APIs, SQL dialects, GUI automation toolkits, and multimodal tool use (vision, audio). Potential product: “SE-Bench Suite” covering major enterprise stacks and modalities. Dependencies/assumptions: Language-specific static analysis; comparable “no-bypass” constraints; scalable task/test generation.
Safer RL for knowledge use (not memorization) — AI research, safety New RL algorithms specifically tuned to consolidate and utilize internalized knowledge without destabilizing learning, preserving the empirical benefit found in the paper’s SFT→RL pipeline. Potential tools: Utilization-RL objectives, anti-hallucination reward shaping, advantage clipping/positivity schedules. Dependencies/assumptions: Theoretical advances in stability; reproducible baselines across domains; robust evals that separate “factual internalization” from “behavioral optimization.”
Privacy-preserving internalization for sensitive domains — healthcare, finance, govtech Closed-book SFT pipelines that internalize sensitive knowledge with strong privacy guarantees (federated or on-prem), plus proofs that models no longer require sensitive context at inference. Potential product: On-prem “Internalization Appliance” with audit trails and secret scanning. Dependencies/assumptions: Private fine-tuning infrastructure; legal clarity on model updates; defense against membership inference.
Personalized assistants that genuinely learn user ecosystems — daily life, productivity Over time, assistants internalize a user’s tools/workflows (e.g., home automation schemas, personal scripts) and stop needing explicit prompts or docs to act. Potential product: Local fine-tuning on-device or in a private cloud with periodic closed-book updates. Dependencies/assumptions: Lightweight fine-tuning on edge; user consent and control; safeguards against drift or misgeneralization.
Benchmark-driven procurement and SLAs — enterprise IT, policy Contracts specify internalization targets (e.g., pass@K thresholds on domain-specific SE-Bench variants) and remediation actions if models regress after updates. Potential tools: SLA clauses tied to benchmark metrics; continuous monitoring dashboards. Dependencies/assumptions: Legal frameworks recognizing technical benchmarks; automated, repeatable test harnesses.

Notes on key assumptions and dependencies across applications

Fine-tuning access: Many applications assume access to model weights or at least efficient adapters (LoRA) and sufficient compute.
Domain adaptation: SE-Bench’s approach generalizes, but wrappers, obfuscation, and validators must be rebuilt per language/domain.
Safety vs. capability: Removing PPO clipping/negative advantage may destabilize RL; use SFT for internalization and reserve RL for robust utilization with care.
Evaluation integrity: AST/IR checks are language-specific; bypass-resistant evaluation must be designed per stack.
Data/IP governance: Training on proprietary docs requires strong privacy, logging, and compliance processes.

View Paper Prompt View All Prompts

Glossary

Abstract Syntax Tree (AST): A tree-structured representation of code used for structural verification and constraint checking. "we employ a strict Abstract Syntax Tree (AST) verification protocol."
Advantage (RL): A reinforcement learning signal measuring how much better an action is relative to a baseline, guiding policy updates. "Ablation of RL components. Internalization collapses when using PPO clip loss or including negative advantage."
Closed-Book Training: A training setup where reference materials are removed during updates to force knowledge into model parameters. "requiring 'Closed-Book Training' to force knowledge compression into weights;"
Closed-RL: Reinforcement learning performed without access to documentation during training (closed-book). "Closed-RL achieves zero performance."
Closed-SFT: Supervised fine-tuning without documentation in the training context to promote internalization. "Closed-SFT successfully internalizes knowledge, maintaining performance even when documentation is absent."
Compositional complexity: The difficulty arising from composing multiple functions/APIs to solve a task. "we stratify the test set by compositional complexity:"
Compositional Generalization: The ability to combine previously learned components or functions to solve new, multi-step tasks. "- Compositional Generalization: By retaining the library's original structure, we can evaluate whether agents can compose internalized functions to solve multi-step problems beyond their specific training examples that involves only a single function."
Consensus Filtering: A quality-control process that retains tasks only if multiple strong models independently agree on correct solutions. "We employ a strict Consensus Filtering protocol."
Deliberative Alignment: An alignment strategy emphasizing structured reasoning steps during training. "OpenAI's Deliberative Alignment"
GRPO algorithm: A reinforcement learning method (a PPO variant) used for post-training LLMs. "All RL-based methods utilize the GRPO algorithm"
Group-normalized advantage: An advantage computation normalized across a set of trajectories, producing both positive and negative signals. "and $A_i$ is the group-normalized advantage (containing both positive and negative values)."
Hallucination: Model-generated content that invents non-existent APIs or behaviors, often due to uncertainty. "primarily due to hallucination: qualitative analysis reveals that Qwen3-8B frequently tries to use NumPy namespace"
Importance sampling ratio: A correction factor used in off-policy RL to account for differences between behavior and target policies. "theoretically necessitates an importance sampling ratio"
In-Context: A setting where relevant documentation or hints are provided in the model’s input at inference/training time. "ZWC In-Context"
Isomorphic: Structurally equivalent in a way that preserves functionality or logic under a mapping. "the logic is isomorphic to standard NumPy"
KL regularization: A penalty term based on Kullback–Leibler divergence to constrain policy updates during RL. "we omit the KL regularization."
Knowledge obfuscation: A process that hides or renames known APIs/identifiers to create a pseudo-novel domain. "We employ a knowledge obfuscation mechanism"
Needle in a Haystack test: A diagnostic that checks whether a model can find or use specific information that makes tasks trivial if known and impossible if not. "the community needs a 'Needle in a Haystack' test"
Off-policy: An RL training regime where learning uses data generated under a different policy (or prompt condition) than the one being optimized. "applying RL in the similar off-policy settings"
Open-Book Paradox: The phenomenon where access to documentation during training inhibits long-term knowledge retention. "the Open-Book Paradox, where training with reference documentation inhibits retention"
Open-RL: Reinforcement learning conducted with documentation present in the training context. "Open-RL"
Open-SFT: Supervised fine-tuning with documentation present in the training context. "Closed-SFT vs. Open-SFT."
Parametric memory: Knowledge encoded directly in model parameters rather than stored/retrieved from context. "rely on parametric memory rather than context."
Pass@64: A code-generation metric indicating the probability that at least one of 64 attempts solves the task. "Pass@64 of Qwen3-8B."
PPO clipping: The Proximal Policy Optimization technique that clips policy updates to restrict large parameter shifts. "due to PPO clipping and negative gradients"
Prefix-RL: An RL approach that leverages privileged prefixes or prompts during training to improve generalization. "Prefix-RL"
Pre-training leakage: Undesired reliance on knowledge already present in pre-training data, confounding evaluation of novelty. "confirms no pre-training leakage."
Self-Play: A training paradigm where a model generates its own tasks and learns from self-produced data. "the viability of Self-Play"
Supervised Fine-Tuning (SFT): Updating model parameters by maximizing likelihood on labeled examples (teacher-forced trajectories). "Supervised Fine-Tuning (SFT)"
Trust region violation: Making parameter updates that exceed the small, safe update region assumed by trust-region methods, risking instability. "effectively a 'trust region violation.'"
veRL: A reinforcement learning framework/tooling used to implement training algorithms like GRPO. "within the veRL framework."
Wrapper package: A software layer that renames or proxies underlying library functions while preserving behavior. "we implement ZWC as a wrapper package."
Zero-Shot: Performing a task without any task-specific training or examples provided to the model. "ZWC Zero-Shot"

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Summary

SE-Bench: A Diagnostic Platform for Knowledge Internalization in Self-Evolving Agents

Motivation and Problem Setting

Benchmark Construction and Evaluation Protocol

Experimental Results and Major Empirical Insights

Analysis of Optimization Dynamics

Autonomous Self-Play and Internalization

Behavior Consolidation and Error Shift

Diversity Effects

Self-Play Internalization Dynamics

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How SE-Bench Works (Explained Simply)

Main Findings and Why They Matter

Implications and Potential Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on key assumptions and dependencies across applications

Glossary

Open Problems

Continue Learning

Authors (6)

Collections

Tweets

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Summary

SE-Bench: A Diagnostic Platform for Knowledge Internalization in Self-Evolving Agents

Motivation and Problem Setting

Benchmark Construction and Evaluation Protocol

Experimental Results and Major Empirical Insights

Analysis of Optimization Dynamics

Autonomous Self-Play and Internalization

Behavior Consolidation and Error Shift

Diversity Effects

Self-Play Internalization Dynamics

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How SE-Bench Works (Explained Simply)

Main Findings and Why They Matter

Implications and Potential Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on key assumptions and dependencies across applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Tweets