Artificial Intelligence 24 min read

Turning ReAct from Demo to Production: Handling Failures, Loops, and Token Budgets

This article explains how to upgrade a ReAct agent from a proof‑of‑concept to a production‑ready system by classifying tool failures, detecting repeated search loops, managing token budgets, and adding structured logging, complete with Python implementations and practical interview guidance.

Wu Shixiong's Large Model Academy

Apr 13, 2026

Turning ReAct from Demo to Production: Handling Failures, Loops, and Token Budgets

1. Tool failures aren't just exceptions

Most developers initially wrap tool calls in a simple try/except and retry on error, but this only handles explicit exceptions. Real‑world agents also encounter "silent failures" where the tool returns content that is useless or misleading, such as login walls or ad pages.

A concrete case involved a search that returned a subscription‑only page; the model then tried to bypass it, entering a loop of similar queries and ultimately producing nonsense.

We therefore categorize tool failures into three distinct types, each requiring its own handling strategy.

import asyncio
import time

class ToolResult:
    def __init__(self, content: str, status: str, error_type: str = None):
        self.content = content
        self.status = status        # "ok" / "retry" / "skip" / "degrade"
        self.error_type = error_type

async def execute_tool_safe(tool_name: str, tool_fn, **kwargs) -> ToolResult:
    """Hierarchical tool execution: distinguish timeout, empty result, and garbage content, and give different remediation suggestions."""
    # ── First class: timeout/network error → exponential back‑off retry, up to 3 times ──────
    for attempt in range(3):
        try:
            result = await asyncio.wait_for(tool_fn(**kwargs), timeout=10.0)
            break
        except asyncio.TimeoutError:
            if attempt == 2:
                return ToolResult(content="[搜索超时，建议换一个更简短的关键词重试]", status="retry", error_type="timeout")
            await asyncio.sleep(3 ** attempt)  # 1s, 3s, 9s
        except Exception as e:
            return ToolResult(content=f"[工具调用出错：{type(e).__name__}，请调整参数后重试]", status="skip", error_type="hard_error")
    # ── Second class: empty result → do not retry, ask model to change query ────────
    if not result or len(result.strip()) < 50:
        return ToolResult(content="[当前关键词未找到有效结果，请尝试换一种表述方式或拆分为更小的子问题]", status="skip", error_type="empty_result")
    # ── Third class: content quality check → filter garbage before passing to model ─────
    content_flags = _check_content_quality(result)
    if content_flags["is_garbage"]:
        return ToolResult(content=f"[页面内容无法提取（{content_flags['reason']}），请尝试其他来源]", status="skip", error_type=f"garbage_{content_flags['reason']}")
    return ToolResult(content=result, status="ok")

def _check_content_quality(content: str) -> dict:
    """Detect common garbage patterns such as login walls, short pages, or ad‑heavy pages."""
    content_lower = content.lower()
    login_signals = ["please log in", "sign in to continue", "请登录", "订阅后查看", "会员专享"]
    if any(s in content_lower for s in login_signals):
        return {"is_garbage": True, "reason": "login_wall"}
    if len(content.strip()) < 100:
        return {"is_garbage": True, "reason": "too_short"}
    url_count = content.count("http")
    word_count = len(content.split())
    if word_count > 0 and url_count / word_count > 0.15:
        return {"is_garbage": True, "reason": "ad_page"}
    return {"is_garbage": False, "reason": None}

With this hierarchical handling, the proportion of steps that derail due to poor tool output dropped from 11% to under 3% on our test set.

2. Repeated searches: a hidden loop problem

Even after fixing explicit failures, agents can waste steps by repeatedly issuing near‑identical queries. Simple string matching fails because the model often varies wording while preserving meaning.

import numpy as np

class LoopDetector:
    def __init__(self, window: int = 5, threshold: float = 0.88):
        self.window = window
        self.threshold = threshold
        self.history: list[dict] = []  # {"query": str, "embedding": np.array, "result_quality": str}

    def check(self, new_query: str, new_embedding: np.ndarray) -> dict:
        """Check whether the new query is a duplicate of recent searches."""
        if len(self.history) < 2:
            self._record(new_query, new_embedding, "unknown")
            return {"is_loop": False, "hint": ""}
        recent = self.history[-self.window:]
        similarities = [float(np.dot(new_embedding, h["embedding"]) / (np.linalg.norm(new_embedding) * np.linalg.norm(h["embedding"]) + 1e-9)) for h in recent]
        max_sim = max(similarities)
        most_similar_idx = similarities.index(max_sim)
        if max_sim > self.threshold:
            past_quality = recent[most_similar_idx]["result_quality"]
            hint = self._generate_hint(past_quality, recent[most_similar_idx]["query"])
            return {"is_loop": True, "hint": hint}
        self._record(new_query, new_embedding, "unknown")
        return {"is_loop": False, "hint": ""}

    def mark_result_quality(self, quality: str):
        """Record the quality of the most recent search result for future loop detection."""
        if self.history:
            self.history[-1]["result_quality"] = quality

    def _generate_hint(self, past_quality: str, past_query: str) -> str:
        if past_quality == "empty_result":
            return f"检测到重复搜索：'{past_query}'方向之前已经搜索过，未找到有效结果。请从完全不同的角度切入，例如换成对立面、上位概念或具体案例来源。"
        elif past_quality == "garbage":
            return f"检测到重复搜索：'{past_query}'方向返回的是无效内容（登录墙/广告页）。建议直接访问可信来源（如官网、权威媒体）的具体URL，而非继续关键词搜索。"
        else:
            return f"检测到重复搜索：已有相似检索，结果大概率相同。请重新梳理当前的信息缺口，从子问题出发设计搜索词，而不是重复大范围搜索。"

    def _record(self, query, embedding, quality):
        self.history.append({"query": query, "embedding": embedding, "result_quality": quality})
        if len(self.history) > 20:
            self.history.pop(0)

Integrating LoopDetector into the ReAct loop allows the agent to skip redundant searches and inject a helpful hint, improving the effective‑step ratio from 73% to 91% on our benchmark.

3. Token budgeting: act before overflow

ReAct’s context grows linearly; after 10‑15 steps the token count can exceed 30 k, causing the model to lose focus on early information. Instead of reacting after overflow, we proactively trim or compress observations based on usage thresholds.

class TokenBudget:
    def __init__(self, max_tokens: int = 28000):
        # Reserve ~4k tokens for the final answer generation
        self.max_tokens = max_tokens
        self.YELLOW_THRESHOLD = 0.65  # start compact mode
        self.RED_THRESHOLD = 0.85     # trigger history compression

    def get_mode(self, current_tokens: int) -> str:
        ratio = current_tokens / self.max_tokens
        if ratio < self.YELLOW_THRESHOLD:
            return "normal"
        elif ratio < self.RED_THRESHOLD:
            return "compact"
        else:
            return "compress"

    def truncate_observation(self, observation: str, mode: str) -> str:
        if mode == "normal":
            return observation[:3000]
        elif mode == "compact":
            return observation[:1500]
        else:
            return observation[:800]

When the token usage reaches 85 %, we compress the entire history of tool observations into a structured summary, preserving the model’s reasoning (the “think” part) while discarding raw content.

async def compress_history(messages: list, llm) -> list:
    """Summarize past tool results, keep only the reasoning steps."""
    tool_messages = [m for m in messages if m["role"] == "tool"]
    if len(tool_messages) < 3:
        return messages
    all_observations = "

---

".join([m["content"] for m in tool_messages])
    summary_prompt = f"""以下是一次 Deep Research 任务中的所有搜索结果原文。
请提炼成一段结构化摘要，格式：
## 已确认的事实
[列出已找到的关键信息，每条带来源]
## 尚未解决的子问题
[列出还没找到答案的信息缺口]
原始搜索结果：
{all_observations[:8000]}
"""
    summary = await llm.chat_async([{"role": "user", "content": summary_prompt}])
    think_messages = [m for m in messages if m["role"] == "assistant"]
    new_messages = [
        messages[0],  # system prompt
        messages[1],  # user question
        *think_messages,
        {"role": "tool", "tool_call_id": "compressed", "content": f"[已压缩历史搜索结果]
{summary.content}"}
    ]
    return new_messages

Applying this three‑tier budget raised the success rate of complex tasks (those needing >12 steps) from 61% to 78%.

4. Structured logging: know exactly where the agent failed

All the safeguards above rely on visibility into the agent’s execution. Without structured logs the system is a black box.

import json
from datetime import datetime

class AgentLogger:
    def __init__(self, task_id: str):
        self.task_id = task_id
        self.steps = []
        self.start_time = datetime.now()

    def log_step(self, step: int, tool_name: str, tool_args: dict,
                 result_status: str, token_count: int, duration_ms: int):
        """Record each step; duration helps distinguish network timeouts from bad queries."""
        entry = {
            "task_id": self.task_id,
            "step": step,
            "timestamp": datetime.now().isoformat(),
            "tool": tool_name,
            "args": tool_args,
            "result_status": result_status,   # ok/timeout/empty_result/garbage/loop_detected
            "token_count": token_count,
            "duration_ms": duration_ms
        }
        self.steps.append(entry)
        print(f"[{step:02d}] {tool_name}({tool_args.get('query', '')[:40]}) → {result_status} | {token_count} tokens | {duration_ms}ms")

    def summary(self) -> dict:
        total_steps = len(self.steps)
        loop_steps = sum(1 for s in self.steps if s["result_status"] == "loop_detected")
        failed_steps = sum(1 for s in self.steps if s["result_status"] not in ("ok", "loop_detected"))
        return {
            "task_id": self.task_id,
            "total_steps": total_steps,
            "effective_steps": total_steps - loop_steps,
            "loop_rate": loop_steps / total_steps if total_steps else 0,
            "tool_fail_rate": failed_steps / total_steps if total_steps else 0,
            "peak_tokens": max(s["token_count"] for s in self.steps) if self.steps else 0,
            "total_duration_s": (datetime.now() - self.start_time).total_seconds()
        }

With this logger you can instantly see which step triggered loop detection, returned garbage, or caused a token spike.

5. How to answer ReAct engineering questions in interviews

When interviewers ask about production issues, structure your answer in three 30‑second blocks:

Describe the three failure categories (timeout with exponential back‑off, empty result with a prompt to change query, and silent garbage content that must be filtered).

Explain the embedding‑based loop detector, its 0.88 similarity threshold, and the quantitative improvement (effective steps rose from 73% to 91%).

Detail the token‑budget system, the 65% and 85% thresholds, and the resulting success‑rate lift (from 61% to 78%).

Providing concrete numbers shows that you have measured and validated your solutions.

Conclusion

Treat every tool call as potentially garbage until you have positively verified its usefulness; this engineering mindset—anticipating timeouts, empty results, and low‑quality content—turns a fragile demo into a robust production ReAct executor.

LLM Loop Detection Agent Engineering Token Budgeting Tool Failure Handling

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.