How Tongyi DeepResearch Turns Chatty AI into a Research Powerhouse
Tongyi DeepResearch, an open‑source AI model and framework, achieves SOTA on multiple Deep Research benchmarks by combining fully open‑source models, frameworks, and data pipelines, and introduces novel agentic pre‑training, fine‑tuning, and reinforcement‑learning methods to enable complex multi‑step reasoning and real‑world applications.
Tongyi DeepResearch has been launched as a fully open‑source AI system that moves from "chatting" to "research" capabilities, achieving state‑of‑the‑art results on several Deep Research benchmarks and matching or surpassing leading overseas models.
The project releases open‑source models, frameworks, and solutions, making deep research productivity accessible to everyone.
1 Data Strategy: Synthetic Data for Scalable Pre‑training
The team designed a multi‑stage data strategy that generates high‑quality training data without costly human annotation. Incremental pre‑training (Agentic CPT) creates a virtuous loop of data synthesis, while action synthesis produces planning, reasoning, and decision actions at scale.
Data reorganization and question construction : Using collected knowledge documents, web crawls, knowledge graphs, and tool‑call traces to build an entity‑anchored open‑world memory and generate diverse QA pairs.
Action synthesis : Three action types (planning, reasoning, decision) are generated from multi‑style questions and trajectory data, eliminating the need for external API calls.
2 Reasoning Modes
Tongyi DeepResearch supports two inference modes:
2.1 ReAct Mode
Standard ReAct (think‑act‑observe) with a 128K context window enables extensive interaction rounds without prompt engineering.
2.2 Heavy Mode
The "Heavy Mode" (IterResearch paradigm) tackles extremely complex multi‑step research tasks by decomposing them into iterative research rounds, maintaining a focused workspace and integrating findings into a core report.
The Research‑Synthesis framework allows parallel IterResearch agents to explore the same problem and combine their conclusions for higher accuracy.
3 Training Paradigm
The end‑to‑end training pipeline links Agentic CPT → Agentic SFT → Agentic RL. Reinforcement learning uses a customized GRPO algorithm with token‑level policy gradient loss, leave‑one‑out variance reduction, and selective negative‑sample filtering.
Dynamic metrics show rising rewards and high policy entropy, indicating sustained exploration without premature convergence.
Key infrastructure includes a simulated offline training environment, a stable tool sandbox, automated data management with continuous data synthesis, and an asynchronous RL framework built on rLLM.
4 Real‑World Applications
Tongyi DeepResearch powers several Alibaba internal applications, such as the Gaode Travel Agent for complex map and local‑life queries, and Tongyi LawAI (法睿) for legal question answering, contract review, and case analysis, leveraging the agentic architecture and iterative planning.
Extensive research papers detail the Deep Research Agent family, covering benchmarks like WebWalker, WebSailor, WebShaper, and more.
Over the past six months the team has released monthly technical reports, and today six new reports and the Tongyi DeepResearch‑30B‑A3B model are open‑sourced.
Homepage: https://tongyi-agent.github.io/
Blog: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
Github: https://github.com/Alibaba-NLP/DeepResearch
Hugging Face: https://huggingface.co/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B
ModelScope: https://modelscope.cn/models/iic/Tongyi-DeepResearch-30B-A3B
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
