How SE‑Bench Uncovers the Hidden Challenges of Knowledge Internalization in Self‑Evolving AI
The paper introduces SE‑Bench, a code‑based benchmark that isolates knowledge internalization by obfuscating NumPy APIs, and uses it to reveal the Open‑Book paradox, the RL gap, and the potential of self‑play for true self‑evolution in large language models.
Background
Self‑Evolution is considered a crucial capability for achieving artificial general intelligence (AGI). A truly general agent must not only solve problems and use tools but also retain newly acquired knowledge over long time horizons so that it can be applied to future tasks without starting from scratch. Existing evaluation methods cannot reliably measure knowledge internalization because they cannot fully exclude prior‑knowledge contamination or separate failures caused by reasoning difficulty from those caused by a lack of internalized knowledge.
SE‑Bench: Knowledge‑Internalization Testbed
SE‑Bench is a benchmark that isolates prior knowledge and tests pure memory of a newly introduced API. The benchmark is built by heavily obfuscating the NumPy library to create a synthetic library ZWC that the model has never seen during pre‑training.
Absolute Prior Isolation: All NumPy function names are replaced with random, meaningless tokens, preventing the model from exploiting any pre‑trained NumPy knowledge.
Pure Memory Isolation: Tasks consist of simple NumPy‑style operations that any competent LLM can solve if it remembers the new API, ensuring that success depends solely on internalized knowledge rather than reasoning ability.
Construction Pipeline
Obfuscation Using 268 common NumPy functions, a wrapper library ZWC is created. Each function receives a random name (e.g., zwc.kocito ) and all inputs/outputs are wrapped in a custom ZWCArray to block direct NumPy calls. The original NumPy documentation is regenerated with Gemini‑2.5‑Pro to produce a brand‑new API reference for ZWC .
Question Generation Claude‑4.5‑sonnet generates programming questions based on the original NumPy API. Each question is paired with at least eight test cases. Two families of tasks are produced:
Single‑Function Tasks: One‑function problems covering every function in the library (259 training + 259 test instances).
Multi‑Function Tasks: Problems that require composing three or more functions (440 test instances) to assess compositional generalization.
Filtering All generated questions are independently solved by Qwen3‑Coder‑480B, Gemini‑2.5‑Pro, and GPT‑OSS‑120B. Only questions that pass every test case for all three models are retained. A random 10 % sample is manually inspected for clarity.
The final dataset contains 259 single‑function test tasks, 440 multi‑function test tasks, and 718 single‑function training tasks.
Evaluation Methodology
SE‑Bench uses an abstract‑syntax‑tree (AST) based evaluator with three strict criteria for a solution to be counted as correct:
All test cases must pass.
The AST must show that the returned value depends on ZWC APIs (no hidden NumPy usage).
The submitted code must not import or call NumPy directly.
This ensures that the benchmark measures true internalization of the new API rather than a workaround using the original library.
Key Experimental Findings
The Open‑Book Paradox
When the API documentation is kept in the prompt during both trajectory sampling and parameter update (Open‑SFT), models become heavily dependent on the context and lose all capability once the documentation is removed. Removing the documentation during the update phase (Closed‑SFT) forces the model to compress the information into its weights, yielding significantly higher closed‑book performance.
The RL Gap
Standard reinforcement learning with PPO fails to internalize knowledge in both Open and Closed settings. Ablation experiments show that increasing learning rate and batch size helps, but removing the PPO clipping loss and the advantage term is essential for knowledge internalization, indicating that the clipping mechanism and negative‑gradient signals hinder writing new knowledge into parameters.
Self‑Play Viability
Self‑Play methods that generate synthetic data from the model itself can internalize knowledge when combined with Closed‑SFT (instead of RL). In Closed‑SFT‑Self‑Play, the model reaches 22.5 % accuracy on single‑function tasks and 8.7 % on multi‑function tasks, demonstrating that self‑generated data is a promising avenue for self‑evolution.
Error‑Type Analysis
Failures are categorized into five types: (1) ZWCArray attribute hallucination (e.g., calling a non‑existent method), (2) ZWC function hallucination (using an incorrect function name), (3) parameter errors , (4) return‑value misunderstandings , and (5) incompatibility with native Python . Closed‑SFT‑RL reduces attribute hallucinations dramatically (from 37 % to 10 %) by encouraging more conservative code, but it does not significantly affect function hallucinations or parameter errors.
Conclusions
SE‑Bench provides a clean, controllable environment for measuring knowledge internalization in self‑evolving agents. Experiments uncover two critical phenomena—the Open‑Book paradox and the RL gap—and show that parameter‑based learning with closed‑book training can achieve true internalization, while standard RL struggles without specific modifications. Self‑Play remains a viable, under‑explored direction for future research.
Resources
Paper: https://arxiv.org/pdf/2602.04811 Dataset: https://huggingface.co/datasets/jintailin/SE-Bench Code:
https://github.com/thunlp/SE-BenchHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
