Make Agents Survive Crashes and Restarts: Building a Persistent Task Engine with Durable Execution

The article explains how durable execution, exemplified by Temporal’s Workflow and Activity model, transforms long‑running Agent tasks—such as refund approvals that involve human sign‑off, external APIs, and overnight processing—into recoverable, auditable pipelines that survive crashes, restarts, and timeouts.

AI Step-by-Step
AI Step-by-Step
AI Step-by-Step
Make Agents Survive Crashes and Restarts: Building a Persistent Task Engine with Durable Execution

1. Agents Face Long‑Running Task Chains

In production, agents often fail when a process restarts during a pending approval, an overnight refund, or a long API chain. Unlike ordinary microservices that handle a single request in milliseconds to seconds and release context after the request, agents may run for minutes, hours, or days. Their execution path can change based on model output, tool results, or human input, and they may wait for human approval, third‑party async callbacks, or timed triggers. When a crash occurs, the cost is loss of context, duplicate execution, missing audit, and untraceable state. Only a runtime that treats the whole task chain as a managed, persisted object can avoid a “run‑from‑scratch” fallback.

Execution duration : ordinary microservice – milliseconds to seconds; long‑task agent – minutes, hours, overnight.

Path stability : ordinary – limited branches, fixed contracts; long‑task – model output, tool results, human input rewrite next step.

Waiting object : ordinary – database, cache, downstream service; long‑task – human approval, async callback, timer.

Crash cost : ordinary – replay request or compensate transaction; long‑task – lost context, duplicate work, audit gaps.

2. Temporal’s Core Design Splits Orchestration and Side Effects

Temporal introduces two abstractions:

Workflow records progress, decides when to wait, and determines the next path based on received signals. It must be deterministic: given the same input and history, replay always yields the same commands.

Activity performs unreliable work—model calls, payment APIs, database writes, message dispatches, etc.—and handles timeouts, retries, idempotency, and compensation.

Workflow responsibilities : orchestration, waiting, branching, state advancement, dialogue with external events (e.g., refund state machine, approval wait, timeout branch, replay/reset criteria).

Activity responsibilities : all actions that may fail or have side effects (e.g., LLM call, order read, send approval card, execute refund, sync CRM).

A practical rule: code that decides the next step belongs in a Workflow; code that actually does work and may need retries belongs in an Activity.

3. Refactoring a Refund‑Approval Agent into Durable Execution

Typical failure modes of a naïve synchronous refund agent include lost tasks when the process restarts during approval waiting, and duplicate refunds when the task is replayed without preserved context.

After refactoring, the agent is split into five steps:

Workflow starts and invokes an Activity to load the order, after‑sale record, and risk features.

An Activity runs a model or rule check, producing a decision: auto‑approve, auto‑reject, or require human review.

If human review is needed, the Workflow sends an approval card and waits; the approver’s action is written back via a Signal or Update to the same Workflow.

Upon approval, an Activity calls the financial refund API using a business idempotency key.

Finally, Activities sync the CRM, notify customer service, and write audit records, completing the chain.

@workflow.defn
class RefundWorkflow:
    def __init__(self) -> None:
        self.approval = None

    @workflow.signal
    def approve(self, decision: str) -> None:
        self.approval = decision

    @workflow.run
    async def run(self, refund_id: str) -> str:
        order = await workflow.execute_activity(load_order, refund_id)
        risk = await workflow.execute_activity(score_refund_risk, order)

        if risk["needs_manual_review"]:
            await workflow.execute_activity(send_approval_card, order, risk)
            await workflow.wait_condition(lambda: self.approval is not None)

        if self.approval == "reject":
            return "rejected"

        await workflow.execute_activity(issue_refund, refund_id, refund_id)
        await workflow.execute_activity(sync_crm, refund_id)
        return "done"

The wait condition allows the task to remain alive for hours or days, survive deployments, and recover after failures without losing the approval state.

4. Human Approval, Minute‑Level APIs, and Overnight Batches Belong to the Same Workflow

Once the boundary between Workflow and Activity is clear, seemingly different scenarios collapse into the same pattern: they are all "the task is not finished yet, but don’t lose it."

Human approval

Workflow: persist state, wait for Signal/Update, handle timeout and fallback.

Activity: send approval card, send messages, record approval.

Minute‑level API chain

Workflow: control ordering, failure branches, compensation, concurrency limits.

Activity: call risk service, payment gateway, logistics, CRM, etc.

Overnight batch

Workflow: persist batch context, schedule wake‑up, continue subsequent steps.

Activity: run bulk queries, generate reports, write to DB, send notifications.

5. Time‑Travel Debugging with Event History

When a long‑task Agent misbehaves, ordinary logs only show isolated snapshots. Temporal records every Workflow command and state transition in an Event History, enabling full reconstruction of the execution timeline.

A practical debugging path:

Locate the problematic event node in the Web UI History view.

Inspect inputs, pending Activities, and call stack around that node to understand what the Workflow was waiting for.

Download the Event History JSON and replay it locally to verify whether nondeterminism or an external Activity caused the issue.

When fixing, send an Update, Signal, or Reset from the UI or operations endpoint to resume the task from the appropriate point without discarding the whole chain.

Replay of any historical node requires two guarantees: the runtime retains a complete event history, and the Workflow remains deterministic.

6. Four Warning Signals That Indicate the Need for Durable Execution

Embedding the entire Agent chain in a single HTTP request or background thread More appropriate: entry point only creates a Workflow and returns a task ID; the persistent runtime executes the rest.

Mixing LLM calls, approval waits, and refund actions into one Activity More appropriate: Workflow handles orchestration and waiting; each side‑effect lives in its own Activity with isolated failure semantics.

Using a single approval_status column in a database and polling to assemble the flow More appropriate: write human actions directly to the running Workflow via Signal or Update, keeping state and execution chain unified.

Treating refund, coupon issuance, and shipping writes as ordinary retryable endpoints More appropriate: design Activities with business idempotency keys, timeouts, compensation, and audit records to avoid duplicate side effects on recovery.

7. Comprehensive System Diagram of a Persistent Refund‑Approval Agent

The architecture places the entry service, Workflow, Activities, human‑approval interface, financial system, and Event History together. The entry service only hands off a task ID; the Workflow persists the task’s life; Activities perform external actions; human approvals write back via Signals/Updates; and the Event History supports time‑travel debugging, replay, and audit.

When an Agent spans approvals, external systems, and overnight processing, it qualifies as a task line that must be persisted. Workflow stores the task’s life, Activity executes side effects, human actions feed back into the running Workflow, and the Event History enables debugging and recovery. Durable Execution thus provides the reliability foundation matching the business value of long‑running Agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

workflowAgenttask orchestrationActivityTemporalRefundDurable Execution
AI Step-by-Step
Written by

AI Step-by-Step

Sharing AI knowledge, practical implementation records, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.