Make Agents Survive Crashes and Restarts: Building a Persistent Task Engine with Durable Execution
The article explains how durable execution, exemplified by Temporal’s Workflow and Activity model, transforms long‑running Agent tasks—such as refund approvals that involve human sign‑off, external APIs, and overnight processing—into recoverable, auditable pipelines that survive crashes, restarts, and timeouts.
1. Agents Face Long‑Running Task Chains
In production, agents often fail when a process restarts during a pending approval, an overnight refund, or a long API chain. Unlike ordinary microservices that handle a single request in milliseconds to seconds and release context after the request, agents may run for minutes, hours, or days. Their execution path can change based on model output, tool results, or human input, and they may wait for human approval, third‑party async callbacks, or timed triggers. When a crash occurs, the cost is loss of context, duplicate execution, missing audit, and untraceable state. Only a runtime that treats the whole task chain as a managed, persisted object can avoid a “run‑from‑scratch” fallback.
Execution duration : ordinary microservice – milliseconds to seconds; long‑task agent – minutes, hours, overnight.
Path stability : ordinary – limited branches, fixed contracts; long‑task – model output, tool results, human input rewrite next step.
Waiting object : ordinary – database, cache, downstream service; long‑task – human approval, async callback, timer.
Crash cost : ordinary – replay request or compensate transaction; long‑task – lost context, duplicate work, audit gaps.
2. Temporal’s Core Design Splits Orchestration and Side Effects
Temporal introduces two abstractions:
Workflow records progress, decides when to wait, and determines the next path based on received signals. It must be deterministic: given the same input and history, replay always yields the same commands.
Activity performs unreliable work—model calls, payment APIs, database writes, message dispatches, etc.—and handles timeouts, retries, idempotency, and compensation.
Workflow responsibilities : orchestration, waiting, branching, state advancement, dialogue with external events (e.g., refund state machine, approval wait, timeout branch, replay/reset criteria).
Activity responsibilities : all actions that may fail or have side effects (e.g., LLM call, order read, send approval card, execute refund, sync CRM).
A practical rule: code that decides the next step belongs in a Workflow; code that actually does work and may need retries belongs in an Activity.
3. Refactoring a Refund‑Approval Agent into Durable Execution
Typical failure modes of a naïve synchronous refund agent include lost tasks when the process restarts during approval waiting, and duplicate refunds when the task is replayed without preserved context.
After refactoring, the agent is split into five steps:
Workflow starts and invokes an Activity to load the order, after‑sale record, and risk features.
An Activity runs a model or rule check, producing a decision: auto‑approve, auto‑reject, or require human review.
If human review is needed, the Workflow sends an approval card and waits; the approver’s action is written back via a Signal or Update to the same Workflow.
Upon approval, an Activity calls the financial refund API using a business idempotency key.
Finally, Activities sync the CRM, notify customer service, and write audit records, completing the chain.
@workflow.defn
class RefundWorkflow:
def __init__(self) -> None:
self.approval = None
@workflow.signal
def approve(self, decision: str) -> None:
self.approval = decision
@workflow.run
async def run(self, refund_id: str) -> str:
order = await workflow.execute_activity(load_order, refund_id)
risk = await workflow.execute_activity(score_refund_risk, order)
if risk["needs_manual_review"]:
await workflow.execute_activity(send_approval_card, order, risk)
await workflow.wait_condition(lambda: self.approval is not None)
if self.approval == "reject":
return "rejected"
await workflow.execute_activity(issue_refund, refund_id, refund_id)
await workflow.execute_activity(sync_crm, refund_id)
return "done"The wait condition allows the task to remain alive for hours or days, survive deployments, and recover after failures without losing the approval state.
4. Human Approval, Minute‑Level APIs, and Overnight Batches Belong to the Same Workflow
Once the boundary between Workflow and Activity is clear, seemingly different scenarios collapse into the same pattern: they are all "the task is not finished yet, but don’t lose it."
Human approval
Workflow: persist state, wait for Signal/Update, handle timeout and fallback.
Activity: send approval card, send messages, record approval.
Minute‑level API chain
Workflow: control ordering, failure branches, compensation, concurrency limits.
Activity: call risk service, payment gateway, logistics, CRM, etc.
Overnight batch
Workflow: persist batch context, schedule wake‑up, continue subsequent steps.
Activity: run bulk queries, generate reports, write to DB, send notifications.
5. Time‑Travel Debugging with Event History
When a long‑task Agent misbehaves, ordinary logs only show isolated snapshots. Temporal records every Workflow command and state transition in an Event History, enabling full reconstruction of the execution timeline.
A practical debugging path:
Locate the problematic event node in the Web UI History view.
Inspect inputs, pending Activities, and call stack around that node to understand what the Workflow was waiting for.
Download the Event History JSON and replay it locally to verify whether nondeterminism or an external Activity caused the issue.
When fixing, send an Update, Signal, or Reset from the UI or operations endpoint to resume the task from the appropriate point without discarding the whole chain.
Replay of any historical node requires two guarantees: the runtime retains a complete event history, and the Workflow remains deterministic.
6. Four Warning Signals That Indicate the Need for Durable Execution
Embedding the entire Agent chain in a single HTTP request or background thread More appropriate: entry point only creates a Workflow and returns a task ID; the persistent runtime executes the rest.
Mixing LLM calls, approval waits, and refund actions into one Activity More appropriate: Workflow handles orchestration and waiting; each side‑effect lives in its own Activity with isolated failure semantics.
Using a single approval_status column in a database and polling to assemble the flow More appropriate: write human actions directly to the running Workflow via Signal or Update, keeping state and execution chain unified.
Treating refund, coupon issuance, and shipping writes as ordinary retryable endpoints More appropriate: design Activities with business idempotency keys, timeouts, compensation, and audit records to avoid duplicate side effects on recovery.
7. Comprehensive System Diagram of a Persistent Refund‑Approval Agent
The architecture places the entry service, Workflow, Activities, human‑approval interface, financial system, and Event History together. The entry service only hands off a task ID; the Workflow persists the task’s life; Activities perform external actions; human approvals write back via Signals/Updates; and the Event History supports time‑travel debugging, replay, and audit.
When an Agent spans approvals, external systems, and overnight processing, it qualifies as a task line that must be persisted. Workflow stores the task’s life, Activity executes side effects, human actions feed back into the running Workflow, and the Event History enables debugging and recovery. Durable Execution thus provides the reliability foundation matching the business value of long‑running Agents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Step-by-Step
Sharing AI knowledge, practical implementation records, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
