Artificial Intelligence 31 min read

How to Engineer Reliable Long‑Running AI Coding Tasks: Harnessing Agents for Scale

This article analyzes the challenges of using AI coding agents for large‑scale, long‑running tasks such as bulk file migration or code review, and presents a systematic engineering approach—including task decomposition, parallel execution, persistent progress files, resumable workflows, and multi‑level retry strategies—backed by concrete script examples and real‑world case studies.

Baidu Geek Talk

Apr 8, 2026

How to Engineer Reliable Long‑Running AI Coding Tasks: Harnessing Agents for Scale

AI coding agents excel at well‑defined, small‑scale tasks, but when faced with thousands of files the problems of context exhaustion, interruptions, and uncontrollable behavior become severe. Drawing from practical experience, this article proposes a "Harness Engineering" methodology to make long‑running tasks reliable.

Characteristics of Long‑Running Tasks

Long‑running tasks share three traits: they involve hundreds to thousands of files, exceed a single session’s time limit, and consume tens of millions of tokens.

Core Concerns

Effectiveness : can the task finish correctly?

Speed : how to reduce total execution time?

Cost : how to avoid wasteful token consumption?

Key Difficulties

Context exhaustion – the model’s context window is limited; as more files are processed the history grows, forcing compression that loses detail and leads to "context anxiety" where the agent prematurely declares completion.

Interruptions – network failures, token limits, or timeouts are common; without cross‑session memory a crash forces a full restart.

Uncontrollable behavior at scale – a failure on a single file can cascade, breaking the whole pipeline.

Core Principles

Task decomposition : split the large job into independent sub‑tasks that fit within a single session’s context.

Parallel execution : run many sub‑tasks concurrently to improve speed.

Resumable progress (File‑as‑Progress) : persist each sub‑task’s state to disk so a new session can resume from the last checkpoint.

Completion criteria : define programmatic success checks for each sub‑task and enforce them before marking the task done.

Implementation Details

Task Granularity

Determine sub‑task size based on the model’s context window (e.g., Claude Sonnet ~200 K tokens). A typical sub‑task includes ~1 K tokens for the prompt, 30–60 K for the source file, and 60–180 K for the agent’s multi‑round reasoning, totaling 90–240 K tokens. Adjust granularity by testing token usage against an 80 % window threshold.

CLI‑Based Sub‑Task Execution

Each sub‑task runs as an independent CLI process, launched by external scripts. This isolates context, ensures deterministic prompts, and allows precise control of concurrency.

Prompt generation is programmatic: build-prompt.js which assembles task description, constraints, input file list, output format, and verification criteria.

Dispatch and Poll Scripts

The orchestration consists of two scripts: dispatch.js – prepares the first batch of sub‑tasks (creates Git worktrees, generates prompts, spawns agents) and records their status. poll.js – repeatedly checks running tasks, records successes or failures, and fills empty slots with pending tasks until all are completed.

Example loop:

while true; do
  node scripts/poll.js --task-list task_list.json
  if [ $? -eq 2 ]; then break; fi
  sleep 60
done

Progress Persistence

All state is written to files (TSV, JSON, or plain text). The state machine follows:

TODO → IN_PROGRESS → DONE
               → FAILED
               → SKIPPED

More granular states (ANALYZING, EXECUTING, VERIFYING, etc.) can be added when intermediate artifacts exist.

Multi‑Round Retry Strategy

Inner layer : resume the same conversation after a crash.

Middle layer : feed error output into a new sub‑agent session for targeted fixes (e.g., re‑run tsc --strict and fix the reported type errors).

Outer layer : the main orchestrator decides whether to re‑dispatch permanently failed files based on cost vs. benefit.

Real‑World Scenarios

Full‑Scale Code Review

21 front‑end modules are grouped by directory; each group becomes a sub‑task. The dispatch script launches up to N concurrent agents, each writing its review results to segments/{chunkId}.json. An independent Evaluator agent validates the subjective quality of the reviews.

JS‑to‑TS Migration

All JavaScript/JSX files are migrated to TypeScript. Files are grouped by directory and limited to ~3 000 lines per group. Dependency analysis enforces a topological order so leaf modules are converted first. Completion is verified by AST comparison and successful tsc compilation before marking a sub‑task DONE.

Meta‑Skill Framework

A repository long-term-task-orchestration (https://github.com/hixuanxuan/long-running-agent-tasks) provides a meta‑skill that generates a full skill skeleton for any repetitive large‑scale task. By describing the desired goal, the agent creates SKILL.md, script directories, reference phases, and status handling automatically, turning the engineering pattern itself into reusable tooling.

Conclusion

Harness Engineering bridges the gap between powerful LLM capabilities and reliable production workflows. By continuously reassessing which steps belong to the model and which to the surrounding framework, teams can maintain stable, measurable pipelines even as model abilities evolve.