Build a Local End‑to‑End DeepResearch Agent with Alibaba’s 30B MoE Model Using LangGraph

This guide walks through deploying Alibaba's open‑source Tongyi‑DeepResearch 30B MoE model locally, configuring FastAPI and A2A interfaces, implementing a ReAct‑style agent with LangGraph, setting up research tools, and testing the full UI‑API‑Agent pipeline via CLI and Streamlit.

AI Large Model Application Practice
AI Large Model Application Practice
AI Large Model Application Practice
Build a Local End‑to‑End DeepResearch Agent with Alibaba’s 30B MoE Model Using LangGraph

Project Overview

The Tongyi‑DeepResearch open‑source project provides a 30B Mixture‑of‑Experts (MoE) language model fine‑tuned for deep‑research tasks and an accompanying agent framework. It enables a locally deployable alternative to commercial DeepResearch agents.

Model Details

The core model Tongyi‑DeepResearch‑30B‑A3B is a MoE model (30 B parameters, 3 B active per inference) with a 128 K context window. Weights are available on ModelScope and HuggingFace. The model is built on Qwen and uses a custom tokenizer to improve long‑context reasoning, tool‑calling planning, and ReAct‑style interactions.

Agent Architecture

The system consists of two components:

A fine‑tuned 30 B MoE LLM.

An agent implementation that supports the standard ReAct paradigm. (The proprietary IterResearch paradigm is not released yet.)

Tool Set

The agent can invoke five built‑in tools (implemented in tool_*.py):

search/scholar : Google or academic search via the Serper API.

visit : Web page fetching via Jina API with LLM summarisation.

PythonInterpreter : Executes Python code inside ByteDance’s SandboxFusion Docker container.

parse_file : Parses local files using appropriate Python libraries or multimodal models.

Prompt Tokens

The system prompt (in prompt.py) defines four special tokens recognized by the model: <think/> – internal reasoning. <tool_call/> – request to invoke a tool. <tool_response/> – result returned from the tool. <answer/> – final answer after sufficient information is gathered.

Model Deployment Options

Three ways to obtain an inference endpoint are described:

Download the 60 GB+ weight files from ModelScope or the Magic‑Da community and run a local inference server (vLLM is recommended).

Use OpenRouter’s API (free tier limited to 1 000 calls/day).

Use ModelScope’s inference API (free, 500 calls/day per model).

For local deployment on macOS the author used LM‑Studio with a quantised version (Q5 or higher) and vLLM.

ReAct Agent Implementation

The repository provides a ReAct agent under inference/react_agent.py built with the qwen_agent framework. The author rewrote it using LangGraph for clearer workflow control, avoiding the pre‑built create_react_agent helper because function calling is unnecessary.

The LangGraph workflow performs the following steps:

Receive a research query.

Iteratively generate <think/>, <tool_call/>, and <tool_response/> messages.

Terminate when a timeout, max‑call limit, context overflow, or an <answer/> token is produced.

A simplified streaming API is provided:

async def stream(self, query: str, context_id: str) -> AsyncIterable[Dict[str, Any]]:
    """Stream the workflow, yielding intermediate states such as "working", "completed", or "failed".
    
    Args:
        query (str): User's deep‑research task.
        context_id (str): Conversation ID for session continuity.
    
    Returns:
        AsyncIterable[Dict[str, Any]]: Async iterator yielding dictionaries with status and content.
    """
    ...

A2A Server and Client

To expose the agent to front‑ends, an A2A (Agent‑to‑Agent) server and matching client are implemented:

Server: a2a_server.py runs a FastAPI service exposing the agent via the A2A protocol.

CLI client: a2a_client.py provides an interactive command‑line interface.

Streamlit UI: a2a_streamlit_ui.py offers a web UI for starting tasks, monitoring streaming updates, and displaying final results.

Run the components with:

python a2a_server.py
python a2a_client.py
streamlit run a2a_streamlit_ui.py --server.port 8501

Testing and Observations

Quantised models may omit the <answer/> token, causing the agent to never signal completion.

ReAct works well for Q&A style tasks but struggles with long‑form research reports; the unreleased IterResearch paradigm aims to address this.

Repetitive thinking loops can be mitigated by setting a maximum iteration count.

OpenRouter responses differ slightly in formatting and require custom handling.

Repository

Full source code:

https://github.com/pingcy/TongyiDeepResearch-LangGraph

LLMReActdeploymentA2ALangGraphDeepResearch
AI Large Model Application Practice
Written by

AI Large Model Application Practice

Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.