Artificial Intelligence 14 min read

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

This article examines Copilot 3.0’s planning module, explains how DeepSeek R1’s GRPO reinforcement‑learning pipeline enables flexible multi‑agent orchestration, addresses the limitations of Copilot 2.0, and presents experimental results that show a 61% reduction in reasoning length and a 9% relative gain in accuracy.

Alibaba Cloud Developer

May 26, 2025

How Multi‑Agent Planning Boosts Copilot 3.0 with DeepSeek R1 GRPO Training

Background

The discussion starts from the Planning module of the Copilot 3.0 architecture and, combined with DeepSeek R1’s GRPO reinforcement‑learning practice, explores how a multi‑agent system can flexibly orchestrate multiple agents to solve real‑world problems.

Business Scenario

Basic business support: merchant onboarding, product signing, operation tool usage, etc.

Operation services: settlement, data analysis, strategy recommendation, and more.

Intelligent optimization: keyword configuration, banner generation, product image enhancement, etc.

Users can experience the Copilot assistant by searching “Alipay merchant assistant” in the Alipay app; currently it supports only Alipay‑registered merchants.

Core Capabilities Required

Full‑network search and natural‑language answering

Business data analysis and visualization

Platform strategy intelligent matching

Image material generation and optimization

Precise user‑group selection

Problem Analysis

A single LLM architecture cannot balance business needs with general capabilities.

Sub‑modules such as intent recognition, query rewriting, and task planning are limited.

Handling of complex queries is inefficient.

These challenges motivated a major upgrade to Copilot 3.0 in CY25, adopting a Multi‑Agent architecture where a planning model schedules intelligent agents, greatly improving problem‑solving ability.

Role of Planning

Planning works like solving a complex permutation problem: after fully understanding the user query (context), it distributes the problem to one or more expert agents, covering rewriting, decomposition, allocation, and execution ordering. The approach uses Chain‑of‑Thought (CoT) prompting; Copilot 2.0’s short‑CoT mode proved effective, so the same method is retained and the model is upgraded to explicitly output its reasoning process.

Challenges

Balancing “fish and bear’s paw” – achieving both high accuracy for complex queries and fast, low‑cost reasoning for simple ones.

Long reasoning processes can stall inference and produce repetitive output.

Labeling high‑quality reasoning data is costly; many annotators lack a Chinese‑language background.

Frequent business changes (e.g., increasing the number of agents) require re‑annotation of historical data.

Effectiveness Comparison

After GRPO training, the average inference length dropped from 240.29 tokens to 93.28 tokens (a 61.2% reduction), and the standard deviation fell from 77.30 to 26.11, indicating much more stable outputs. Accuracy increased from 78.7% to 86.1% , an absolute gain of 7.4 points and a relative improvement of 9.4% . These results were obtained on a batch of 3,217 annotated queries.

Solution Overview

1. Data Set Construction

Using DeepSeek R1 to generate synthetic data (reasoning process + planning result), short‑length samples are filtered for the SFT stage to teach the model the required output format. Human‑annotated planning results are also created for GRPO training to ensure consistency.

2. Multi‑Stage Training (SFT + GRPO)

SFT is treated as “exam‑oriented” training, while GRPO provides “quality‑oriented” reinforcement, allowing the model to explore better answers within a defined reward range.

2.1 Training Configuration

Base model: QwQ‑32B

GPU: 3 machines × 24 × A100

Key hyper‑parameters: learning rate ( lr) and KL‑divergence weight ( beta)

Framework: ModelScope’s

ms‑swift

2.2 Reward Function Design

The reward system combines seven functions with weighted averaging, e.g.:

Reward = 0.1*StrictFormatReward + 0.1*JSONValidReward + 0.1*ThinkLengthReward + 0.1*ThinkQualityReward + 0.2*CorrectnessReward + 0.3*ExpertValidationReward + 0.1*ProcessingQualityReward

Key rewards include:

StrictFormatReward : validates XML‑like tags using a regular expression.

JSONValidReward : checks JSON structure and field compliance.

ThinkLengthReward : penalizes reasoning that is too short or too long.

ThinkQualityReward : filters sensitive keywords.

CorrectnessReward : measures rewrite accuracy via semantic similarity and coverage.

ExpertValidationReward : evaluates correct agent assignment.

ProcessingQualityReward : assesses planning accuracy across multiple dimensions.

Example implementation of StrictFormatReward:

class StrictFormatReward(BaseReward):
    _pattern = re.compile(r"^<think>
.*?
</think>

<answer>
.*?
</answer>$", re.DOTALL)

    def __call__(self, completions, **kwargs) -> List[float]:
        processed = self.preprocess(completions)
        return [1.0 if p.answer and self._pattern.match(c) else 0.0 for c, p in zip(completions, processed)]

Example implementation of ThinkLengthReward (simplified):

class ThinkLengthReward(BaseReward):
    def __call__(self, completions, **kwargs) -> List[float]:
        processed = self.preprocess(completions)
        rewards = []
        for p in processed:
            try:
                length = len(p.think)
                if min_length <= length <= max_length:
                    rewards.append(1.0)
                else:
                    deviation = abs(length - mid) / eps
                    reward = 1.0 / (1.0 + np.exp(5 * (deviation - 0.5)))
                    rewards.append(float(reward))
            except Exception as e:
                logger.error(f"Error calculating think length reward: {e}")
                rewards.append(0.0)
        return rewards

3. Multi‑Task Mixed Training

GRPO preserves model generalization by learning to select the most suitable agent from a list. Adding new agents only requires augmenting the dataset with new tasks; the mixed SFT + GRPO training continues to perform well without re‑training from scratch.

Example Planning Output

<think>
好的，我现在需要处理用户的问题：“查看经营周报”。首先，根据提供的工具列表，用户问题属于其中的一项。因此，这个问题应该由数据分析专家来处理，因为他们负责查询和分析工具列表中的数据。接下来，检查是否有其他相关的子问题常要拆解，但用户的问题很明确，所以不需要进一步拆解。最后，确认是否需要其他专家介入，但这里只需数据分析专家即可。
</think>
<answer>{
  "补全后的问题": "查看经营周报",
  "plan": [{
    "专家": "数据分析专家",
    "处理问题": ["查看经营周报"]
  }]
}</answer>

Experimental Observations

Models with reasoning processes converge to reward values around 0.5‑0.6.

Pre‑training with SFT then GRPO yields higher initial format compliance and answer quality.

During training, rewrite ability, expert selection, and planning rewards all improve.

Reasoning length drops to ~150 tokens and eventually stabilizes near 0.9 reward.

Overall, the GRPO‑enhanced Copilot 3.0 demonstrates significant reductions in inference cost while achieving higher accuracy and more reliable multi‑agent planning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Model Training reinforcement learning multi‑agent Planning reward design

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Business Scenario

Core Capabilities Required

Problem Analysis

Role of Planning

Challenges

Effectiveness Comparison

Solution Overview

1. Data Set Construction

2. Multi‑Stage Training (SFT + GRPO)

2.1 Training Configuration

2.2 Reward Function Design

3. Multi‑Task Mixed Training

Example Planning Output

Experimental Observations

Alibaba Cloud Developer

How this landed with the community

Was this worth your time?

0 Comments

2. Multi‑Stage Training (SFT + GRPO)