Artificial Intelligence 17 min read

Why Bigger Context Fails for Deep Research Agents and How IterResearch Fixes It

Interviewers point out that simply enlarging the LLM’s context window cannot prevent forgetting early conclusions in long‑step Deep Research tasks; the article explains the ReAct context issues, introduces the IterResearch framework with evolving reports, and compares its accuracy, cost, and scalability against ReAct and ReSum.

Wu Shixiong's Large Model Academy

Apr 28, 2026

Why Bigger Context Fails for Deep Research Agents and How IterResearch Fixes It

In a recent interview with a ByteDance recruiter, the candidate claimed that expanding the context window to 32k tokens would keep a Deep Research agent’s memory across 20 steps, but the interviewer highlighted that this approach does not truly solve the forgetting problem.

1. ReAct’s Context Problems

ReAct appends each Thought → Action → Observation triplet to a linear history. When the number of steps exceeds about 15, three issues appear:

Context grows, attention thins. Early tokens receive less effective attention, so a fact discovered at step 3 may be drowned out by step 20.

Redundant search results. The linear log accumulates repeated facts and procedural text that occupy space without adding value.

Truncation loss. When the context window fills, the oldest entries—often the original task definition and key conclusions—are cut off, causing the model to “loop” searching for already‑found information.

2. IterResearch Core Idea

IterResearch replaces the ever‑growing dialogue with an evolving report , a structured memory that is updated after each step.

Confirmed Facts : verified information with source and confidence.

Open Questions : sub‑problems that remain unsolved.

Information Gaps : partially known topics that need further evidence.

Current Focus : the next concrete question to address.

After each tool call the agent extracts findings and merges them into the report instead of appending raw observations.

The next inference input consists of the latest snapshot of the report plus the current tool result, keeping the input size roughly constant regardless of the total number of steps.

class EvolvingReport:
    """IterResearch evolving report: central memory for Deep Research agents"""

    def __init__(self):
        self.confirmed_facts: list[dict] = []   # 已确认的事实
        self.open_questions: list[str] = []     # 待解决的子问题
        self.information_gaps: list[str] = []  # 信息缺口
        self.current_focus: str = ""            # 当前研究焦点
        self.research_steps: int = 0           # 已执行步骤数

    def update(self, new_findings: dict):
        """Integrate new step findings into the report"""
        for fact in new_findings.get("facts", []):
            self.confirmed_facts.append({
                "content": fact["content"],
                "source": fact["source"],
                "step": self.research_steps
            })
        resolved = new_findings.get("resolved_questions", [])
        self.open_questions = [q for q in self.open_questions if q not in resolved]
        self.open_questions.extend(new_findings.get("new_questions", []))
        if new_findings.get("next_focus"):
            self.current_focus = new_findings["next_focus"]
        self.research_steps += 1

    def to_prompt_context(self) -> str:
        """Render a fixed‑size prompt from the report"""
        facts_str = "
".join(
            f"- {f['content']} [来源: {f['source']}]"
            for f in self.confirmed_facts[-20:]  # keep recent 20 facts
        )
        questions_str = "
".join(f"- {q}" for q in self.open_questions[:10])
        return f"""## 当前研究状态（第{self.research_steps}步）

### 已确认的事实
{facts_str}

### 尚未解决的子问题
{questions_str}

### 当前研究焦点
{self.current_focus}
"""

Key implementation notes: only the most recent 20 confirmed facts are retained, and solved questions are removed from open_questions, ensuring the report size stays bounded.

3. Driving the Reasoning Loop

async def iter_research(query: str, tools: ToolSet, max_steps: int = 30) -> str:
    """IterResearch main loop"""
    report = EvolvingReport()
    report.current_focus = query
    report.open_questions = [query]

    for step in range(max_steps):
        context = report.to_prompt_context()
        action = await llm.decide_action(
            context=context,
            available_tools=tools.list(),
            system_prompt=ITER_RESEARCH_SYSTEM_PROMPT
        )
        if action.type == "finish":
            break
        observation = await tools.execute(action)
        new_findings = await llm.extract_findings(
            action=action,
            observation=observation,
            current_report=context
        )
        report.update(new_findings)
        if not report.open_questions:
            break
    return await llm.synthesize_report(report)

The extra extract_findings call parses raw tool output into the structured slots of the report. Although it adds one LLM invocation per step, it dramatically reduces token consumption for long‑running tasks because the model no longer scans a 20k‑token history.

4. Quality Control of the Report

Confidence & source tagging. Facts are stored with a confidence level (high/medium/low) and their provenance.

Conflict handling. When a new fact contradicts an existing one, both are kept and the discrepancy is recorded in Information Gaps for later verification.

Fact compression. If more than a threshold number of facts share the same topic, they are summarized into a single entry to prevent unbounded growth.

async def compress_confirmed_facts(facts: list[dict], topic_threshold: int = 5) -> list[dict]:
    """Summarize facts when a topic exceeds the threshold"""
    clustered = cluster_by_topic(facts)
    compressed = []
    for topic, topic_facts in clustered.items():
        if len(topic_facts) > topic_threshold:
            summary = await llm.compress_facts(topic_facts)
            compressed.append({
                "content": summary,
                "source": "compressed_from_multiple",
                "confidence": "medium",
                "original_count": len(topic_facts)
            })
        else:
            compressed.extend(topic_facts)
    return compressed

5. IterResearch vs ReSum vs ReAct

Both IterResearch and ReSum aim to curb context explosion, but their strategies differ.

ReSum (dynamic summarization). Keeps the linear history and triggers an LLM summarization when the context nears its limit. It is low‑cost to add but introduces lossy compression and may degrade detail after many passes.

IterResearch (evolving report). Replaces the linear log entirely with a structured report, giving the model a clear view of the current research state without needing to search through old tokens.

Empirical tests show that for tasks >15 steps, IterResearch achieves a noticeable accuracy boost (≈18 % higher) while consuming fewer total tokens than ReAct, despite the extra extract_findings call.

6. Choosing the Right Strategy

Guidelines:

Tasks < 10 steps – use ReAct for simplicity.

Tasks 10‑20 steps – ReSum offers a lightweight upgrade.

Tasks > 20 steps – IterResearch provides stable performance.

When accuracy is critical and cost is acceptable – consider Research‑Synthesis with parallel verification.

7. How to Answer in an Interview

When asked about handling context overflow in Deep Research agents, follow this outline:

State the core problem (≈20 s): “ReAct’s linear history causes context growth and attention decay after ~15 steps, which cannot be solved by merely enlarging the window.”

Explain IterResearch’s solution (≈30 s): “We replace the history with an evolving report that records confirmed facts, open questions, information gaps, and the current focus, keeping the input size constant.”

Describe the implementation detail (≈30 s): “The extra extract_findings step parses tool results into the report; in our tests it raises accuracy by 18 % for >15‑step tasks while reducing token usage.”

Conclude with selection logic (≈15 s): “Use ReAct for short tasks, ReSum for medium, and IterResearch for deep research.”

Conclusion

The article resolves the long‑standing interview question of how to manage context when Deep Research agents run many steps. ReAct is suitable for entry‑level scenarios, but its linear history hits a ceiling. IterResearch’s structured, evolving report offers a scalable solution that maintains inference quality regardless of step count.

ReAct linear history vs IterResearch evolving report

IterResearch data structure and update mechanism

Quality control mechanisms for the evolving report

IterResearch vs ReSum selection comparison

LLM Prompt Engineering ReAct Context Management Deep Research IterResearch

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.