How Tongyi DeepResearch Turns Chatty AI into a Research Powerhouse

Tongyi DeepResearch, an open‑source AI model and framework, achieves SOTA on multiple Deep Research benchmarks by combining fully open‑source models, frameworks, and data pipelines, and introduces novel agentic pre‑training, fine‑tuning, and reinforcement‑learning methods to enable complex multi‑step reasoning and real‑world applications.

DataFunTalk
DataFunTalk
DataFunTalk
How Tongyi DeepResearch Turns Chatty AI into a Research Powerhouse

Tongyi DeepResearch has been launched as a fully open‑source AI system that moves from "chatting" to "research" capabilities, achieving state‑of‑the‑art results on several Deep Research benchmarks and matching or surpassing leading overseas models.

The project releases open‑source models, frameworks, and solutions, making deep research productivity accessible to everyone.

1 Data Strategy: Synthetic Data for Scalable Pre‑training

The team designed a multi‑stage data strategy that generates high‑quality training data without costly human annotation. Incremental pre‑training (Agentic CPT) creates a virtuous loop of data synthesis, while action synthesis produces planning, reasoning, and decision actions at scale.

Data reorganization and question construction : Using collected knowledge documents, web crawls, knowledge graphs, and tool‑call traces to build an entity‑anchored open‑world memory and generate diverse QA pairs.

Action synthesis : Three action types (planning, reasoning, decision) are generated from multi‑style questions and trajectory data, eliminating the need for external API calls.

2 Reasoning Modes

Tongyi DeepResearch supports two inference modes:

2.1 ReAct Mode

Standard ReAct (think‑act‑observe) with a 128K context window enables extensive interaction rounds without prompt engineering.

2.2 Heavy Mode

The "Heavy Mode" (IterResearch paradigm) tackles extremely complex multi‑step research tasks by decomposing them into iterative research rounds, maintaining a focused workspace and integrating findings into a core report.

The Research‑Synthesis framework allows parallel IterResearch agents to explore the same problem and combine their conclusions for higher accuracy.

3 Training Paradigm

The end‑to‑end training pipeline links Agentic CPT → Agentic SFT → Agentic RL. Reinforcement learning uses a customized GRPO algorithm with token‑level policy gradient loss, leave‑one‑out variance reduction, and selective negative‑sample filtering.

Dynamic metrics show rising rewards and high policy entropy, indicating sustained exploration without premature convergence.

Key infrastructure includes a simulated offline training environment, a stable tool sandbox, automated data management with continuous data synthesis, and an asynchronous RL framework built on rLLM.

4 Real‑World Applications

Tongyi DeepResearch powers several Alibaba internal applications, such as the Gaode Travel Agent for complex map and local‑life queries, and Tongyi LawAI (法睿) for legal question answering, contract review, and case analysis, leveraging the agentic architecture and iterative planning.

Extensive research papers detail the Deep Research Agent family, covering benchmarks like WebWalker, WebSailor, WebShaper, and more.

Over the past six months the team has released monthly technical reports, and today six new reports and the Tongyi DeepResearch‑30B‑A3B model are open‑sourced.

Homepage: https://tongyi-agent.github.io/

Blog: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/

Github: https://github.com/Alibaba-NLP/DeepResearch

Hugging Face: https://huggingface.co/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B

ModelScope: https://modelscope.cn/models/iic/Tongyi-DeepResearch-30B-A3B

open-sourceAI researchsynthetic dataAgentic Reinforcement Learning
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.