DeepSeek Researcher Co‑authors Two New Papers on Autonomous AI Research and Continual Learning
The article summarizes two recent DeepSeek papers—one presenting an L1–L5 taxonomy and four architecture patterns for autonomous research agents, the other proposing a three‑dimensional taxonomy for continual learning, detailing method families, a self‑improvement phase diagram, experimental comparisons, an impossibility theorem, and the production statistics of the Deli AutoResearch framework.
1. From Copilots to Colleagues – A Panorama of Autonomous Research Agents
1.1 Core proposition: grading “AI research colleagues”
The paper introduces an L1–L5 autonomous‑research‑agent taxonomy, analogous to SAE levels for self‑driving, providing a precise measurement scale for the field.
L1 Automatic Completion: GitHub Copilot, only code completion.
L2 Task Execution: ChatGPT + tools, each step requires human approval.
L3 Multi‑step Autonomy: Claude Code, Cursor Agent, human intervenes only at checkpoints.
L4 Full Autonomy: Devin, AI Scientist, SWE‑Agent; run for hours‑days with only final output evaluated.
L5 Self‑Directed: Hypothetical stage where AI selects topics, plans research agendas, and accumulates cross‑domain knowledge.
1.2 Four major architecture patterns
The survey systematically reviews four dominant patterns—single‑agent (ReAct), cross‑episode (Reflexion), hierarchical orchestration (LATS/ToT), multi‑agent chat, and tool‑enhanced approaches—providing a quantitative comparison.
1.3 Evaluation of 17 mainstream systems
The authors analyze 17 systems across six dimensions (autonomy level, architecture, domain, tooling, evaluation, openness). Key findings:
Code agents (SWE‑Agent, Devin, Claude Code, OpenHands) are the most mature, with SWE‑bench success rate rising from ~5 % early 2024 to >70 %.
Scientific discovery agents (Coscientist, ChemCrow, FunSearch) reach L4 in constrained domains.
General research agents (AI Scientist) can generate full papers at a cost of $15 per paper but cannot autonomously choose topics.
Current frontier stalls at L4: the bottleneck is continuous knowledge accumulation, reliable self‑evaluation, and scalable architecture rather than model capability.
1.4 Six open challenges
Cognitive Loops: AI can fall into infinite loops without recognizing failure.
Context Limits: Long‑term research exceeds context windows, causing loss of early key information.
Novelty Evaluation: No reliable automated metric to judge true innovation.
Reproducibility: Nondeterministic outputs, prompt sensitivity, and model‑version dependence hinder repeatability.
Safety: Dual‑use risks, uncontrolled self‑improvement, and potential scientific fraud.
Cost: Token consumption for long‑term autonomous research is huge, exacerbating academic inequality.
2. Never Stop Learning – Survey of LLM Continual Learning and Self‑Improvement
2.1 Unified three‑dimensional taxonomy: What × How × When
The paper’s major contribution is a taxonomy that simultaneously covers continual learning (CL) and self‑improvement (SI) along three axes:
What (knowledge, skills, alignment)
How (external supervision, self‑generated signals, architectural adaptation)
When (offline batch, online streaming, test‑time adaptation)
2.2 Five method families and quantitative comparison
Over 100 papers are grouped into five families—parameter isolation (Adapter/LoRA), regularization (EWC/SI), replay, architectural (MoE)—with a quantitative suitability assessment for LLM scenarios.
2.3 Three‑phase diagram of self‑improvement
The authors formalize self‑improvement trajectories into three fates:
Convergent: External validation or human data anchors; performance monotonically improves to a fixed point (e.g., STaR, SPIN).
Platform: After 3–5 rounds, returns diminish; the system exhausts correctable errors.
Collapse: Without external signals, model distribution collapses over generations, leading to degradation.
2.4 Original experiment: interaction of continual learning and self‑improvement
A multi‑model comparison (DeepSeek‑Chat, GPT‑5.2, Claude Sonnet 4.6, Gemini 3 Flash) reveals:
SI‑only: GPT‑5.2 exhibits deterministic collapse on GSM8K—after three self‑refinement rounds accuracy locks at 80 %.
CL + SI with replay buffer: Adding a prompt‑based replay restores GPT‑5.2 accuracy to 88.3 %, escaping the collapse attractor.
Validator quality matters: Strong validators (Claude) yield smaller CL gains; weaker validators (Gemini) produce larger gains (+15 pp).
2.5 Formal theorem: impossibility triangle for CL and SI
When inter‑task gradient conflict (γ‑diversity) is sufficiently large, no parameter solution can simultaneously achieve optimal continual learning (forgetting ≤ ε) and optimal self‑improvement (gain ≥ δ) unless ε + δ ≥ γ(T‑1)/(d/r).
Interpretation: larger model capacity d and smaller SI update rank r (e.g., LoRA) alleviate the trade‑off, explaining why large models with parameter‑efficient fine‑tuning are currently the most practical path.
3. Behind the Scenes – How the Deli AutoResearch Framework Co‑authored the Papers
Both papers disclose that the Deli AutoResearch framework assisted generation, using DeepSeek‑V4‑Pro for text/reasoning and GPT‑Image‑2 for figure creation.
Total time: Paper 1 – 6 days; Paper 2 – ~11 hours.
First‑draft time: Paper 1 – 76 minutes; Paper 2 – not applicable.
Research iteration rounds: 6 vs 10.
Agent interaction rounds: ~108 vs 18+.
Estimated token consumption: ~648 K vs ~1.58 M.
BibTeX entries: 103 vs 151.
Citation verification rate: 100 % for both.
Hallucinated citation entries: 0 for both.
Pages: 45 vs 47.
Figures: 7 vs 8.
Original experiments: 0 vs 1 (multi‑model × 3 reproductions).
Surprising details:
Paper 1’s first draft produced a 42‑page LaTeX document in 76 minutes (~27 lines/minute).
Paper 2 underwent three rounds of LLM peer review, raising its score from 6.0/10 to 8.0/10.
All 254 citations were cross‑validated by AI, resulting in zero hallucinated references.
持续学习 https://victorchen96.github.io/continual_learning_survey.pdf
自动化科研 https://victorchen96.github.io/auto_research_survey.pdfSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
