When an Agent Fails: Retry, Fallback, and Human Takeover Strategies
The article classifies agent failures into transient, structural, and semantic types, compares how Claude Code, OpenAI Codex, and Google Gemini CLI agents handle errors, and shows how LangGraph implements robust retry policies, fallback routing, and human‑in‑the‑loop handoff with concrete code examples and best‑practice guidelines.
Failure Type Triangle Classification
Before writing code, classify failures into three categories, each with a distinct handling strategy.
Transient – API rate‑limit (429), network timeout, service outage. Self‑healing is possible. Strategy: exponential backoff retry.
Structural – tool schema mismatch, missing endpoint, insufficient permissions. Not self‑healing. Strategy: route to a fallback node.
Semantic – low confidence, large answer deviation, sensitive operation awaiting approval. Self‑healing uncertain. Strategy: human takeover.
After catching an error inside a node, convert it to a state message with an error‑type tag (e.g., [ERROR:RATE_LIMIT], [ERROR:SCHEMA]) and let the routing function dispatch based on the tag.
How Mainstream CLI Agents Handle Errors
Three open‑source CLI agents were examined.
Claude Code (files src/services/api/errors.ts and QueryEngine.ts) – fine‑grained error categorization via categorizeRetryableAPIError(). SDK automatically retries and exposes retryAttempt, maxRetries, retryInMs for UI feedback. Supports a configurable fallbackModel that switches to a backup model when the primary overloads.
OpenAI Codex CLI (file codex-rs/core/src/codex.rs) – core function try_run_turn implements per‑provider retry budget and exponential backoff min(2^attempt, 30s). Real‑time retry status is pushed via notify_stream_error. No built‑in fallback; exhaustion raises an error.
Google Gemini CLI (directory packages/core/src/api/) – respects the Retry-After header for 429, uses exponential backoff with jitter for 5xx, and never retries 4xx errors. Includes a circuit‑breaker concept.
Design Conclusions – From Survey to LangGraph Implementation
Never retry 4xx errors.
Prefer the Retry-After header when present.
Make fallback a configuration, not hard‑coded logic.
Expose retry status to the user.
Reset per‑provider retry counters when switching providers.
LangGraph extends these ideas to graph‑level error handling, enabling complex degradation paths and human handoff.
Retry Strategy: State Counter + Exponential Backoff + RetryPolicy
import { Annotation, StateGraph, START, END, RetryPolicy } from "@langchain/langgraph"
import { AIMessage, HumanMessage } from "@langchain/core/messages"
// Extend state with retry fields
const RetryState = Annotation.Root({
messages: Annotation<Array<HumanMessage | AIMessage>>({
reducer: (a, b) => [...a, ...b],
default: () => []
}),
retryCount: Annotation<number>({ default: () => 0 }),
maxRetries: Annotation<number>({ default: () => 3 }),
lastError: Annotation<string>({ default: () => "" })
})
function retryRouter(state: typeof RetryState.State): string {
const lastMsg = state.messages[state.messages.length - 1]?.content?.toString() ?? ""
if (!lastMsg.includes("[ERROR")) return "success"
return state.retryCount < state.maxRetries ? "retry" : "fallback"
}
const retryPolicy: RetryPolicy = {
maxAttempts: 3,
initialInterval: 500,
backoffFactor: 2,
maxInterval: 10000,
jitter: true,
retryOn: (error) => ["429", "502", "503", "timeout"].some(c => error.message.includes(c))
}
const retryGraph = new StateGraph(RetryState)
.addNode("fetch", unreliableNode, { retryPolicy })
.addNode("fallback", async () => ({ messages: [new AIMessage("All retries exhausted. Using cached result.")] }))
.addEdge(START, "fetch")
.addConditionalEdges("fetch", retryRouter, { retry: "fetch", success: END, fallback: "fallback" })
.addEdge("fallback", END)
.compile()Exponential backoff with jitter prevents the thundering‑herd effect: Math.min(500 * 2**retryCount + Math.random()*500, 10000) ms.
Fallback Path: Primary → Secondary → Cache
const FallbackState = Annotation.Root({
messages: Annotation<Array<HumanMessage | AIMessage>>({
reducer: (a, b) => [...a, ...b],
default: () => []
}),
currentProvider: Annotation<"primary" | "secondary" | "cache">({ default: () => "primary" }),
retryCount: Annotation<number>({ default: () => 0 })
})
function fallbackRouter(state: typeof FallbackState.State): string {
const lastMsg = state.messages[state.messages.length - 1]?.content?.toString() ?? ""
if (lastMsg.includes("[ERROR:PRIMARY]")) return "to_secondary"
if (lastMsg.includes("[ERROR:SECONDARY]")) return "to_cache"
return "success"
}
async function switchToSecondaryNode(state) {
return { currentProvider: "secondary", retryCount: 0, messages: [] }
}
const fallbackGraph = new StateGraph(FallbackState)
.addNode("primary", primarySearchNode)
.addNode("switch_to_secondary", switchToSecondaryNode)
.addNode("secondary", secondarySearchNode)
.addNode("cache", cacheNode)
.addEdge(START, "primary")
.addConditionalEdges("primary", fallbackRouter, { to_secondary: "switch_to_secondary", to_cache: "cache", success: END })
.addEdge("switch_to_secondary", "secondary")
.addConditionalEdges("secondary", fallbackRouter, { to_secondary: "secondary", to_cache: "cache", success: END })
.addEdge("cache", END)
.compile()Human Takeover: Three Trigger Scenarios + External Resume
import { interrupt, Command, MemorySaver } from "@langchain/langgraph"
// Scenario 1: High‑risk operation confirmation
async function deleteDataNode(state) {
const target = state.messages[state.messages.length - 1].content as string
const approval = interrupt({ type: "approval", target, message: `⚠️ About to delete: ${target}, confirm?` })
if (approval !== "yes") return { messages: [new AIMessage("[Cancelled]")] }
return { messages: [new AIMessage(`[Done] Deleted: ${target}`)] }
}
// Scenario 2: Low‑confidence answer review
async function lowConfidenceNode(state) {
const answer = await generateAnswer(state)
if (answer.confidence < 0.7) {
const feedback = interrupt({ type: "review", draft: answer.text, confidence: answer.confidence })
return { messages: [new AIMessage(feedback === "approve" ? answer.text : feedback)] }
}
return { messages: [new AIMessage(answer.text)] }
}
// Resume after interrupt
const compiledGraph = graph.compile({ checkpointer: new MemorySaver() }) // use PostgresSaver in production
const thread = { configurable: { thread_id: "task-456" } }
const result = await compiledGraph.invoke(initialState, thread) // pauses on interrupt
const resumed = await compiledGraph.invoke(new Command({ resume: "yes" }), thread) // continuesThe interrupt() call serializes the graph state to a checkpoint; without a checkpointer the graph cannot resume.
Common Pitfalls (5 Errors 90% of People Make)
Routing function lacks a fallback branch, causing endless retries and token exhaustion.
Failover does not reset retryCount, so the secondary provider inherits the exhausted counter.
Human‑in‑the‑loop without a checkpoint; interrupt() fails because there is nowhere to store state.
Retrying all errors indiscriminately; include only retry‑able codes (429, 502, 503, timeout).
Missing jitter in retries; fixed‑interval retries cause thundering‑herd under concurrency.
Conclusion
Classify failures first: transient → retry, structural → fallback, semantic → human takeover.
Never retry 4xx errors.
Prefer the Retry-After header when available.
Design a complete fallback chain: primary → secondary → cache, resetting counters on provider switch.
Human takeover relies on checkpoints; LangGraph provides this capability while CLI agents do not.
Combine an outer RetryPolicy for transient errors with inner state routing for business‑level degradation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
