Deploy DeepSeek R1 with Prefill‑Decode Separation on Baidu Baige

This guide explains how to set up Baidu Baige's PD‑separated deployment for the DeepSeek R1 large‑language model, covering resource preparation, data acquisition, Prefill and Decode service configuration, and API invocation to achieve lower latency and higher throughput.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Deploy DeepSeek R1 with Prefill‑Decode Separation on Baidu Baige

Large language model inference can be split into Prefill and Decoder stages; deploying each stage on separate nodes enables parallel execution, improving both business performance and resource efficiency.

1. Prepare Resources

Purchase two H20 GPU cloud servers and create a Baidu Baige generic resource pool. Enable parallel file storage (PFS) or file storage (CFS) and bind it to the resource pool.

2. Prepare Data

DeepSeek R1 model weights are stored in Baidu Object Storage (BOS). Download the appropriate weight files for your region:

Beijing: bos:/aihc-models-bj/deepseek-ai/DeepSeek-R1

Suzhou: bos:/aihc-models-su/deepseek-ai/DeepSeek-R1

Guangzhou: bos:/aihc-models-gz/deepseek-ai/DeepSeek-R1

Install bcecmd on the login node and configure it.

./bcecmd bos sync bos:/aihc-models-bj/deepseek-ai/DeepSeek-R1 /mnt/model/DeepSeek-R1

3. Deploy Services

3.1 Deploy Prefill Service

Set service name, select the generic resource pool, and choose H20 as the accelerator.

Enable Prefill‑Decode separation and turn on RDMA.

Use Baidu Baige's pre‑installed AIAK inference acceleration image (no modifications allowed).

Mount storage: source path = model weight location on PFS/CFS, target path = /mnt/model.

Allocate 8 H20 chips for the Prefill stage.

Set environment variable MODEL_NAME for routing (must match the Decode service).

Configure service port; the PD deployment automatically enables the cloud‑native AI gateway (HTTP only).

3.2 Deploy Decode Service

Same basic settings as Prefill (service name, generic pool, H20 accelerator).

Enable Prefill‑Decode separation, set service type to Decode, and enable RDMA.

Use the same AIAK inference image.

Mount the same storage path (/mnt/model).

Allocate 8 H20 chips for the Decode stage.

Set the identical MODEL_NAME environment variable.

Configure service port; the cloud‑native AI gateway is enabled by default.

4. Model Invocation

After both Prefill and Decode services are running, obtain the HTTP endpoint and token from the service list. Call the service using an OpenAI‑compatible API.

curl --location '192.168.12.235/auth/ap-f3450f14c/8088' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer YOUR_TOKEN' \
  --data '{
    "model": "deepseek r1-pd",
    "prompt": "hello",
    "max_tokens": 10,
    "temperature": 0
  }'
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DeepSeekLLM inferenceRDMAGPU deploymentBaidu BaigePrefill-Decode
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.