32 min read

Designing High‑Quality Tools for Deep Research Agents: From Search to Python Execution

This article explains how to turn simple API calls into robust, noise‑filtering tools—Search, Visit, Scholar, and Python—by adding domain blacklists, relevance scoring, query‑driven extraction, safety sandboxes, and a unified registry, ultimately boosting the success rate of LLM‑driven research agents.

Wu Shixiong's Large Model Academy

Apr 14, 2026

Designing High‑Quality Tools for Deep Research Agents: From Search to Python Execution

1. Why Tool Design Is Harder Than It Looks

Even after handling basic failures (timeouts, empty results, obvious garbage), the real challenge is defining and eliminating "garbage" content. A tool must not only retrieve information but also filter out irrelevant or noisy parts, much like a shopper who selects only the needed items from a shelf.

In earlier versions, the Visit tool returned an entire 5,000‑character page, mixing valuable text with navigation, comments, and ads. This noise caused the model to waste tokens and produce poor reports. Effective tools prioritize relevance filtering from the start.

Key insight: The most important responsibility of a tool is filtering noise , not merely fetching data. A well‑designed Visit tool that extracts a concise 200‑character relevant snippet can improve task success by over 30%.

2. Search Tool: More Than Forwarding API Results

The naive approach returns a raw list of links, but the list often contains low‑quality domains and short, spammy snippets. To improve quality:

import aiohttp
from dataclasses import dataclass
from typing import Optional
import re

LOW_QUALITY_DOMAINS = {
    "zhidao.baidu.com",
    "wenwen.sogou.com",
    "wenda.so.com",
    "tieba.baidu.com",
    "mp.weixin.qq.com",
}

@dataclass
class SearchResult:
    url: str
    title: str
    snippet: str
    domain: str
    quality_score: float

async def search(query: str, num_results: int = 5) -> list[SearchResult]:
    """Why default to 5 results? Tests show the top 5 cover ~90% of useful info while more results dilute attention."""
    raw_results = await _call_search_api(query, num_results=10)
    filtered = []
    for r in raw_results:
        domain = _extract_domain(r["url"])
        if domain in LOW_QUALITY_DOMAINS:
            continue
        snippet_quality = _score_snippet(r.get("snippet", ""))
        if snippet_quality < 0.3:
            continue
        filtered.append(SearchResult(
            url=r["url"],
            title=r["title"],
            snippet=r["snippet"],
            domain=domain,
            quality_score=snippet_quality,
        ))
        if len(filtered) >= num_results:
            break
    return sorted(filtered, key=lambda x: x.quality_score, reverse=True)

def _score_snippet(snippet: str) -> float:
    if len(snippet) < 50:
        return 0.1
    if len(snippet) > 800:
        snippet = snippet[:800]
    spam_patterns = ["点击查看", "立即购买", "优惠活动", "限时折扣", "免费下载"]
    for p in spam_patterns:
        if p in snippet:
            return 0.2
    score = 0.5
    if re.search(r'\d+', snippet):
        score += 0.1
    tech_signals = ["方案", "实现", "研究", "分析", "数据", "结果", "测试"]
    matches = sum(1 for s in tech_signals if s in snippet)
    score += min(0.3, matches * 0.05)
    return min(1.0, score)

Results should be formatted as concise natural‑language text rather than raw JSON to save tokens:

def format_search_results(results: list[SearchResult]) -> str:
    """Why format instead of JSON? Natural language is easier for LLMs and avoids field‑name overhead."""
    lines = [f"搜索返回 {len(results)} 条结果：
"]
    for i, r in enumerate(results, 1):
        lines.append(f"[{i}] {r.title}")
        lines.append(f"    来源：{r.domain}")
        lines.append(f"    摘要：{r.snippet[:200]}")
        lines.append("")
    return "
".join(lines)

Takeaway: Add a domain blacklist and snippet quality filter; return only the top 5 high‑quality results in a readable format.

3. Visit Tool: Extraction, Not Full Copy

The Visit tool must know the research query to extract only relevant passages. A two‑stage process is used: first strip navigation/footer, then perform paragraph‑level relevance ranking.

from bs4 import BeautifulSoup
import httpx

async def visit(url: str, query: str, max_chars: int = 2000) -> str:
    """Query is mandatory; without it the tool would return the whole page, which is useless noise."""
    try:
        async with httpx.AsyncClient(timeout=10.0) as client:
            resp = await client.get(url, follow_redirects=True)
            resp.raise_for_status()
    except httpx.TimeoutException:
        return "[访问超时]"
    except Exception as e:
        return f"[访问失败：{e}]"
    main_text = _extract_main_content(resp.text)
    if len(main_text) > max_chars:
        main_text = _filter_by_relevance(main_text, query, max_chars)
    return main_text if main_text else "[页面内容无法提取]"

def _extract_main_content(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup.find_all(["nav", "footer", "header", "aside", "script", "style"]):
        tag.decompose()
    for selector in ["article", "main", '[role="main"]', ".post-content", ".article-body", "#content"]:
        element = soup.select_one(selector)
        if element and len(element.get_text(strip=True)) > 200:
            return element.get_text(separator="
", strip=True)
    paragraphs = [p.get_text(strip=True) for p in soup.find_all("p") if len(p.get_text(strip=True)) > 50]
    return "
".join(paragraphs)

def _filter_by_relevance(text: str, query: str, max_chars: int) -> str:
    paragraphs = [p.strip() for p in text.split("
") if len(p.strip()) > 30]
    if not paragraphs:
        return text[:max_chars]
    query_words = set(query.lower().split())
    scored = []
    for para in paragraphs:
        para_words = set(para.lower().split())
        overlap = len(query_words & para_words)
        position_bonus = 0.3 if paragraphs.index(para) < 3 else 0
        scored.append((para, overlap + position_bonus))
    scored.sort(key=lambda x: x[1], reverse=True)
    selected = []
    total = 0
    for para, _ in scored:
        if total + len(para) > max_chars:
            break
        selected.append(para)
        total += len(para)
    selected_set = set(selected)
    result = [p for p in paragraphs if p in selected_set]
    return "

".join(result)

The tool also reuses a generic _check_content_quality() function to detect paywalls or empty pages, returning a status flag (ok/paywall/empty) for the ReAct executor.

4. Scholar Tool: Higher Quality Gate for Academic Sources

Academic search requires stricter relevance and freshness checks. The tool fetches up to 10 raw results, filters out entries without abstracts, scores relevance, and finally returns only the top 3 papers.

import aiohttp
from dataclasses import dataclass
from datetime import datetime

@dataclass
class ScholarResult:
    title: str
    abstract: str
    authors: list[str]
    year: int
    citation_count: int
    venue: str
    open_access_url: Optional[str]
    relevance_score: float

async def scholar_search(query: str, max_results: int = 3) -> list[ScholarResult]:
    """Why default to 3 papers? Their combined abstracts (~700‑1200 chars) give enough context without overwhelming the model."""
    raw = await _call_semantic_scholar(query, limit=10)
    results = []
    for paper in raw:
        abstract = paper.get("abstract", "")
        if not abstract or len(abstract) < 50:
            continue
        relevance = _compute_relevance(query, paper)
        results.append(ScholarResult(
            title=paper.get("title", ""),
            abstract=abstract[:500],
            authors=[a.get("name", "") for a in paper.get("authors", [])[:3]],
            year=paper.get("year", 0),
            citation_count=paper.get("citationCount", 0),
            venue=paper.get("venue", "未知"),
            open_access_url=paper.get("openAccessPdf", {}).get("url"),
            relevance_score=relevance,
        ))
    results.sort(key=lambda x: x.relevance_score, reverse=True)
    return results[:max_results]

def _compute_relevance(query: str, paper: dict) -> float:
    score = 0.0
    title = paper.get("title", "").lower()
    abstract = paper.get("abstract", "").lower()
    query_words = set(query.lower().split())
    title_words = set(title.split())
    title_overlap = len(query_words & title_words) / max(len(query_words), 1)
    score += title_overlap * 0.5
    abstract_words = set(abstract.split())
    abstract_overlap = len(query_words & abstract_words) / max(len(query_words), 1)
    score += abstract_overlap * 0.3
    current_year = datetime.now().year
    year = paper.get("year", 0)
    if year >= current_year - 2:
        score += 0.15
    elif year >= current_year - 5:
        score += 0.05
    if paper.get("openAccessPdf", {}).get("url"):
        score += 0.05
    return min(1.0, score)

def format_scholar_results(results: list[ScholarResult]) -> str:
    if not results:
        return "未找到相关学术文献。"
    lines = [f"找到 {len(results)} 篇相关文献：
"]
    for i, r in enumerate(results, 1):
        author_str = "、".join(r.authors) if r.authors else "未知作者"
        lines.append(f"[{i}] {r.title}")
        lines.append(f"    {author_str}（{r.year}年）| {r.venue} | 被引 {r.citation_count} 次")
        lines.append(f"    摘要：{r.abstract}")
        if r.open_access_url:
            lines.append(f"    全文链接：{r.open_access_url}")
        lines.append("")
    return "
".join(lines)

When returning results, the tool explicitly marks each entry as an academic source, including year and venue, so the LLM can weigh freshness and authority itself.

5. Python Tool: Mandatory Secure Sandbox

Executing LLM‑generated code poses security risks. The design uses a subprocess with a strict import whitelist, disables dangerous built‑ins, enforces a short timeout, and truncates long outputs.

import subprocess
import tempfile
import os
import sys
from typing import Any

ALLOWED_IMPORTS = {"math", "statistics", "json", "re", "datetime", "collections", "itertools", "functools", "numpy", "pandas", "scipy"}
DANGEROUS_PATTERNS = ["import os", "import sys", "import subprocess", "__import__", "eval(", "exec(", "open(", "file(", "socket", "urllib", "requests", "httpx", "shutil", "glob"]

def validate_code(code: str) -> tuple[bool, str]:
    for pattern in DANGEROUS_PATTERNS:
        if pattern in code:
            return False, f"代码包含不允许的操作：{pattern}"
    return True, ""

async def python_execute(code: str, timeout: int = 10) -> str:
    valid, reason = validate_code(code)
    if not valid:
        return f"[代码验证失败：{reason}]"
    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
        restricted_prefix = f"""
import sys
_original_import = __builtins__.__import__
def _safe_import(name, *args, **kwargs):
    if name.split('.')[0] not in {ALLOWED_IMPORTS}:
        raise ImportError(f"不允许导入 {{name}}")
    return _original_import(name, *args, **kwargs)
__builtins__.__import__ = _safe_import

__builtins__.__dict__['open'] = None
__builtins__.__dict__['eval'] = None
__builtins__.__dict__['exec'] = None
"""
        f.write(restricted_prefix + "
" + code)
        tmp_path = f.name
    try:
        result = subprocess.run([sys.executable, tmp_path], capture_output=True, text=True, timeout=timeout)
        if result.returncode != 0:
            error_lines = result.stderr.strip().split("
")[-5:]
            return "[执行出错]
" + "
".join(error_lines)
        output = result.stdout.strip()
        if len(output) > 2000:
            output = output[:2000] + f"
[输出已截断，共 {len(output)} 字符]"
        return output if output else "[代码执行完成，无输出]"
    except subprocess.TimeoutExpired:
        return f"[执行超时：超过 {timeout} 秒]"
    finally:
        os.unlink(tmp_path)

Only the last few error lines are sent back to the model to avoid distracting it with irrelevant stack frames.

6. Unified Tool Interface: Registering the Four‑Tool Suite

All tools expose a consistent schema via a ToolRegistry. The ReAct executor calls tools by name without hard‑coding logic.

from typing import Callable, Any
from dataclasses import dataclass

@dataclass
class ToolDefinition:
    name: str
    description: str
    parameters: dict
    fn: Callable

class ToolRegistry:
    """Why a registry? It enables dynamic tool sets per scenario without changing the core loop."""
    def __init__(self):
        self._tools: dict[str, ToolDefinition] = {}
    def register(self, tool_def: ToolDefinition):
        self._tools[tool_def.name] = tool_def
    async def call(self, name: str, args: dict) -> str:
        if name not in self._tools:
            return f"[工具不存在：{name}]"
        try:
            result = await self._tools[name].fn(**args)
            return str(result)
        except Exception as e:
            return f"[工具执行异常：{e}]"
    def get_schema(self) -> list[dict]:
        return [{"name": t.name, "description": t.description, "parameters": t.parameters} for t in self._tools.values()]

def build_research_tools() -> ToolRegistry:
    registry = ToolRegistry()
    registry.register(ToolDefinition(
        name="search",
        description="Search the web for a query and return a concise list of relevant results.",
        parameters={"type": "object", "properties": {"query": {"type": "string", "description": "Search keywords"}}, "required": ["query"]},
        fn=search,
    ))
    registry.register(ToolDefinition(
        name="visit",
        description="Visit a URL and return content relevant to the current research query.",
        parameters={"type": "object", "properties": {"url": {"type": "string", "description": "Target URL"}, "query": {"type": "string", "description": "Research question"}}, "required": ["url", "query"]},
        fn=visit_with_quality_check,
    ))
    registry.register(ToolDefinition(
        name="scholar",
        description="Search academic literature and return ranked paper metadata.",
        parameters={"type": "object", "properties": {"query": {"type": "string", "description": "Research topic"}}, "required": ["query"]},
        fn=scholar_search,
    ))
    registry.register(ToolDefinition(
        name="python",
        description="Execute safe Python code for calculations or data processing.",
        parameters={"type": "object", "properties": {"code": {"type": "string", "description": "Python code, last line should print the result"}}, "required": ["code"]},
        fn=python_execute,
    ))
    return registry

Deep Research tool integration architecture

7. How to Answer Tool‑Design Questions in Interviews

When asked about tool design, start with the core problem (30 s): explain that the main difficulty is filtering noise, not just calling APIs. Then briefly cover each tool’s key design decisions (1 min): blacklist domains for Search, mandatory query‑driven extraction for Visit, relevance‑based ranking for Scholar, sandboxed execution for Python, and a unified registry. If the interviewer probes deeper (e.g., why not use containers for Python), mention that subprocess isolation with a whitelist meets >90 % of safety needs while keeping latency low; containers add cold‑start overhead and are reserved for higher‑risk environments.

Conclusion

Combining the minimal ReAct loop, a production‑grade executor, and the four quality‑controlled tools yields a fully functional Deep Research Agent capable of handling real‑world tasks. However, as the number of steps grows, token limits become a bottleneck; future work will replace the linear history with an "Evolving Report" mechanism (IterResearch framework).

AI agents ReAct web-scraping tool design LLM safety python sandbox search tool Semantic Scholar

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.