AutoResearch SKILL Open‑Source: Framework for Long‑Horizon Autonomous Research
The Deli AutoResearch SKILL, now open‑sourced, presents a three‑layer framework that tackles cognitive loops, stalling, and runtime fragility in long‑horizon tasks by persisting state, detecting stalls, and using a heartbeat watchdog, and it includes a paper‑writing skill with self‑play experiments that achieve self‑rated scores up to 8.6.
The Deli AutoResearch SKILL has been released as an open‑source protocol for long‑horizon autonomous tasks such as academic paper writing, continual learning, and self‑play research.
Framework Mechanism
AutoResearch addresses three recurring failure modes observed in long‑running agents: (1) cognitive loops – repeated similar directions with diminishing returns; (2) stalling – agents finish a chunk and wait for user feedback; (3) runtime fragility – context compaction silently breaks the loop. The solution is a three‑layer architecture:
Orchestrator layer reads progress, detects stalls, and generates new directions.
Work Agent layer executes the concrete task.
Guardian layer runs a heartbeat watchdog that checks task liveness, restarts stalled loops, and nudges agents.
State is persisted to files (progress.json, findings.jsonl, directions_tried.json, etc.) and the orchestrator injects only curated state into fresh sessions, avoiding context accumulation.
Stall Detection & Direction Pivot
Stall detection increments a stale_count when an iteration yields no new findings or a metric drops. When stale_count reaches a threshold, the system forces a pivot by changing a structural constraint rather than tweaking tactical parameters. Direction diversity is enforced by requiring each new direction to differ from all previously tried ones.
Paper‑Writing Skill
The paper‑writing skill decomposes a research paper into five sub‑processes: literature search, structure design, experiment planning, figure/table generation, and simulated peer review. Quality gates require passing checks on citations, PDF compilation, review score, and non‑regression of fixed issues before advancing to the next phase.
Self‑Play Paper (285B GRPO)
The fourth released paper, "Self‑Play in the Age of Foundation Models," investigates how the quality of the verification signal affects self‑play gains. Experiments with a 285B GRPO model show that with clean signals (ε=0) the improvement is +4.8 %, while with high noise (ε=0.45) the improvement becomes –6.6 %. KL‑ablation reveals that adding a KL penalty (KL=0.01) recovers a +0.8 % gain and raises the held‑out IMOAnswerBench score from 0.525 to 0.686, whereas removing KL leads to a –10.9 % drop.
Review scores progressed from 6 to 8, then to 8.5 after adding the 285B experiment, and finally to 8.6 after theoretical hardening. Scores are derived from an in‑framework multi‑persona simulated review and are comparable only within the same protocol.
Usage Example
---
name: Deli_AutoResearch
description: A protocol framework for long-horizon autonomous tasks. Targets three empirically‑observed failure modes — cognitive loops, stalling, runtime fragility — by prescribing state management, stall detection, and watchdog mechanisms. Validated on multiple task types including paper writing (4 ICLR‑format surveys, in‑framework self‑rating 8.0‑8.6/10).
type: Agent Framework
tags: autonomous, long-horizon, zero-interaction, anti-loop, heartbeat-watchdog, loop, multi-agent, unattended, orchestration
---
# Deli_AutoResearch
This skill is a protocol framework for long‑horizon autonomous tasks (days to weeks). It ships no executable code; instead it prescribes a set of battle‑tested conventions: how state is persisted, how stalls are detected, how guardians are layered, and what constraints bind agent behavior. Implementation details are left to the adopter's environment.
## 1. Motivation
Long‑running code agents exhibit three recurring failure modes:
1. Cognitive loop — successive iterations try similar directions with diminishing returns.
2. Stalling — the agent finishes a chunk, outputs a summary, and waits for user feedback.
3. Runtime fragility — context compaction silently breaks the loop.
The common cause is missing engineering scaffolding, not insufficient model capability.
## 2. Behavioral Constraints
1. Zero interaction — no prompting the user during a run.
2. Ready means execute — finishing preparation must lead to execution without asking for confirmation.
3. Callback means report‑alive — each callback updates its own <code>last_seen</code> and restarts if stale.
4. Persist state to files — all progress is written to <code>state/</code> files, never to conversation memory.
5. Guardian / worker separation — the watchdog only checks liveness, restarts, or nudges.
## 3. Architecture
┌── Orchestrator (current session / durable cron) ──┐
│ monitor state files → detect stalls → inject direction │
└────┬─────────────┬─────────────┬────────────┘
[Task A] [Task B] [Task C] ← each its own fresh session
Core decisions: separate execution from evaluation, start each iteration with a fresh session, enforce direction diversity.
## 4. State Files
{task}/state/
├── task_spec.md # goal / milestones / success criteria
├── progress.json # {iteration, total_findings, status, stale_count}
├── findings.jsonl # accumulated findings (append‑only)
├── directions_tried.json # directions already tried
└── iteration_log.jsonl # per‑iteration summary
## 5. Usage
1. Initialize the task directory, write <code>task_spec.md</code> and an initial <code>progress.json</code>.
2. Start the orchestrator loop: `/loop 2h check all tasks …` – read progress, generate a fresh direction if <code>stale_count≥3</code>, launch a work agent, write results back.
3. Register a durable heartbeat watchdog that writes a timestamp hourly and restarts any task whose <code>last_seen</code> exceeds three times the interval.
## 6. Stall Detection & Pivoting
| Mechanism | Rule |
|-----------|------|
| Stall detection | iteration with 0 new findings or metric drop → <code>stale_count + 1</code> |
| Forced pivot | <code>stale_count≥2</code> → change a structural constraint; <code>stale_count≥4</code> → flag for human attention |
| Direction diversity | new direction must differ from every tried one |
| Round cap | max 15 rounds or 30 min per work session |
## 7. Heartbeat Watchdog
Three layers (L0 resident shell guard, L1 durable cron, L2 business loop) monitor each other; any dead layer is revived by another.
## 8. Validation & Limits
| Paper | Pages | Citations | Self‑rated |
|-------|-------|-----------|-----------|
| Autonomous Research Agents | 59 | 228 | 8.0/10 |
| Continual Learning | 65 | 326 | 8.0/10 |
| Long‑Horizon Decision‑Making | 55 | 384 | 8.0/10 |
| Self‑Play (285B RL experiment + theory hardening) | 75 | 217 | 8.6/10 |
Limits: scores are from in‑framework simulated review, longest continuous run 72 h with six directional inputs, fabricated citations originate from the LLM and are mechanically checked.Conclusion
The open‑source SKILL provides a reproducible, engineering‑first approach to autonomous research, exposing detailed metrics, quality‑gate checkpoints, and experimental results that can be directly fed to large language models or adapted to custom task domains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
