Build a Local End‑to‑End DeepResearch Agent with Alibaba’s 30B MoE Model Using LangGraph
This guide walks through deploying Alibaba's open‑source Tongyi‑DeepResearch 30B MoE model locally, configuring FastAPI and A2A interfaces, implementing a ReAct‑style agent with LangGraph, setting up research tools, and testing the full UI‑API‑Agent pipeline via CLI and Streamlit.
Project Overview
The Tongyi‑DeepResearch open‑source project provides a 30B Mixture‑of‑Experts (MoE) language model fine‑tuned for deep‑research tasks and an accompanying agent framework. It enables a locally deployable alternative to commercial DeepResearch agents.
Model Details
The core model Tongyi‑DeepResearch‑30B‑A3B is a MoE model (30 B parameters, 3 B active per inference) with a 128 K context window. Weights are available on ModelScope and HuggingFace. The model is built on Qwen and uses a custom tokenizer to improve long‑context reasoning, tool‑calling planning, and ReAct‑style interactions.
Agent Architecture
The system consists of two components:
A fine‑tuned 30 B MoE LLM.
An agent implementation that supports the standard ReAct paradigm. (The proprietary IterResearch paradigm is not released yet.)
Tool Set
The agent can invoke five built‑in tools (implemented in tool_*.py):
search/scholar : Google or academic search via the Serper API.
visit : Web page fetching via Jina API with LLM summarisation.
PythonInterpreter : Executes Python code inside ByteDance’s SandboxFusion Docker container.
parse_file : Parses local files using appropriate Python libraries or multimodal models.
Prompt Tokens
The system prompt (in prompt.py) defines four special tokens recognized by the model: <think/> – internal reasoning. <tool_call/> – request to invoke a tool. <tool_response/> – result returned from the tool. <answer/> – final answer after sufficient information is gathered.
Model Deployment Options
Three ways to obtain an inference endpoint are described:
Download the 60 GB+ weight files from ModelScope or the Magic‑Da community and run a local inference server (vLLM is recommended).
Use OpenRouter’s API (free tier limited to 1 000 calls/day).
Use ModelScope’s inference API (free, 500 calls/day per model).
For local deployment on macOS the author used LM‑Studio with a quantised version (Q5 or higher) and vLLM.
ReAct Agent Implementation
The repository provides a ReAct agent under inference/react_agent.py built with the qwen_agent framework. The author rewrote it using LangGraph for clearer workflow control, avoiding the pre‑built create_react_agent helper because function calling is unnecessary.
The LangGraph workflow performs the following steps:
Receive a research query.
Iteratively generate <think/>, <tool_call/>, and <tool_response/> messages.
Terminate when a timeout, max‑call limit, context overflow, or an <answer/> token is produced.
A simplified streaming API is provided:
async def stream(self, query: str, context_id: str) -> AsyncIterable[Dict[str, Any]]:
"""Stream the workflow, yielding intermediate states such as "working", "completed", or "failed".
Args:
query (str): User's deep‑research task.
context_id (str): Conversation ID for session continuity.
Returns:
AsyncIterable[Dict[str, Any]]: Async iterator yielding dictionaries with status and content.
"""
...A2A Server and Client
To expose the agent to front‑ends, an A2A (Agent‑to‑Agent) server and matching client are implemented:
Server: a2a_server.py runs a FastAPI service exposing the agent via the A2A protocol.
CLI client: a2a_client.py provides an interactive command‑line interface.
Streamlit UI: a2a_streamlit_ui.py offers a web UI for starting tasks, monitoring streaming updates, and displaying final results.
Run the components with:
python a2a_server.py
python a2a_client.py
streamlit run a2a_streamlit_ui.py --server.port 8501Testing and Observations
Quantised models may omit the <answer/> token, causing the agent to never signal completion.
ReAct works well for Q&A style tasks but struggles with long‑form research reports; the unreleased IterResearch paradigm aims to address this.
Repetitive thinking loops can be mitigated by setting a maximum iteration count.
OpenRouter responses differ slightly in formatting and require custom handling.
Repository
Full source code:
https://github.com/pingcy/TongyiDeepResearch-LangGraph
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
