Deploy DeepSeek‑V4 on Ascend NPU with Kthena in 3 Minutes (Prefill‑Decode Separation)

This guide walks through deploying the DeepSeek‑V4‑Flash model on Ascend NPU using Kthena’s ModelRoute, detailing the Prefill‑Decode (P/D) separation architecture, KV cache transfer via Mooncake, configuration of ModelServing and ModelRoute resources, and flexible scaling of Prefill and Decode replicas for optimal performance.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Deploy DeepSeek‑V4 on Ascend NPU with Kthena in 3 Minutes (Prefill‑Decode Separation)

Background

In large‑model inference, Prefill‑Decode (P/D) separation is a widely adopted performance‑optimisation architecture. As model parameters grow, inference latency and resource consumption become critical. A monolithic architecture cannot simultaneously optimise first‑token latency (TTFT) and overall throughput (TPOT), whereas P/D separation splits inference into two independent stages, allowing each stage to use the most suitable parallel strategy and achieving significant performance gains.

P/D Separation Technical Principle

Two‑stage inference

Prefill stage (first token generation) processes the entire prompt, generating the first output token. It is compute‑intensive, highly parallel, requires full attention over the input sequence, and writes the KV cache.

Compute‑intensive : each token attends to all input tokens.

High parallelism : all positions can be processed in parallel, suitable for large tensor‑parallel (TP) scales.

Memory‑access pattern : KV cache is created for the first time.

Latency‑sensitive : first‑token latency directly impacts user experience.

In our configuration Prefill uses DP=2, TP=8, a large batch size and token limit to accelerate matrix computation:

--data-parallel-size 2 \
--tensor-parallel-size 8 \
--max-num-batched-tokens 8192 \
--max-num-seqs 4 \

Decode stage (incremental generation) processes one newly generated token at a time, adding it to the sequence for the next round. Its characteristics differ sharply from Prefill:

Memory‑bound : only one token is computed per step, but the full KV cache must be read.

Low parallelism : excessive tensor parallelism would increase communication overhead.

Throughput‑sensitive : maximising the number of tokens generated per unit time is essential.

Frequent KV transfer : each Decode instance needs the KV cache produced by Prefill instances.

Decode uses DP=8, TP=2 to improve overall throughput:

--data-parallel-size 8 \
--tensor-parallel-size 2 \
--max-num-batched-tokens 144 \
--max-num-seqs 48 \

Why P/D improves performance

Traditional monolithic deployment faces a "one‑size‑fits‑all" dilemma: to satisfy both Prefill and Decode requirements, a compromise on parallel strategy is required, preventing either stage from reaching its optimum. P/D separation decouples the stages, allowing independent configuration of tensor and data parallelism, resource allocation, and scaling.

Tensor parallelism : Prefill can use TP=8, Decode TP=2.

Data parallelism : each role can be scaled independently.

Resource allocation : per‑role specifications avoid a unified specification.

Scaling : independent scaling of Prefill and Decode instances.

Using DeepSeek‑V4 as an example, Prefill optimisation leverages larger TP and batch size, while Decode optimisation increases data parallelism and reduces TP to lower communication cost.

KV Cache Transfer Mechanism

After P/D separation, Prefill and Decode must efficiently transfer the KV cache, which is one of the most critical technical challenges. Our deployment uses Mooncake Connector V1 to implement KV transfer:

--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",\
 "kv_role": "kv_producer",    # Prefill as producer\
 "kv_port": "9000",\
 "engine_id": "$MOONCAKE_ENGINE_ID",\
 ...}'

Mooncake provides an Ascend‑NPU‑optimised high‑performance communication library that enables low‑latency KV cache transmission between nodes.

Producer (Prefill) : generates KV cache and sends it to the Mooncake server.

Consumer (Decode) : pulls the KV cache from the Mooncake server for attention calculation.

Engine ID : each P/D instance group has a unique ${GROUP_NAME}_${ROLE_ID} to ensure correct routing.

Router Adaptation for P/D

In a P/D architecture, routing is no longer a simple dispatch to a backend instance; it must intelligently coordinate the Prefill and Decode stages, which differs fundamentally from traditional micro‑service routing.

Traditional micro‑service routing:

Client Request → Router → Backend Instance (process whole request)

P/D routing:

Client Request → Router → Prefill Instance (generate first token)\
                     ↕ (KV transfer)\
                     Decode Instance (complete generation)

Challenges

Request lifecycle management : a request passes through multiple stages (Prefill, KV transfer, multiple Decode rounds, final aggregation).

P/D instance discovery and matching : the number of Prefill and Decode replicas may differ, and the router must match the correct pair.

KV transfer coordination : KV must be sent to the correct Decode instance, engine IDs must match, and timeouts/errors must be handled.

Traffic allocation strategy : different scenarios may require different Prefill/Decode ratios (e.g., compute‑intensive vs. I/O‑intensive workloads).

ModelRoute Design

Kthena’s ModelRoute is built to solve the above challenges. It defines routing rules, a pdGroup for automatic P/D identification, role‑specific labels, and KV connector configuration.

groupKey : label key used to group Prefill and Decode pods belonging to the same P/D instance group.

prefillLabels and decodeLabels : mark which pods play the Prefill or Decode role.

kvConnector : specifies the Mooncake connector type.

Instance Discovery Process

The Kthena controller injects the following labels into each pod (e.g., in deepseek-serv.yaml):

metadata:
  labels:
    modelserving.volcano.sh/name: deepseekv4-pd
    modelserving.volcano.sh/group-name: <group-id>
    modelserving.volcano.sh/role: prefill/decode
    modelserving.volcano.sh/role-id: <role-id>

Discovery steps:

Query all pods with modelserving.volcano.sh/name=deepseekv4-pd.

Group them by modelserving.volcano.sh/group-name.

Within each group, match pods whose labels correspond to prefillLabels and decodeLabels.

When scaling occurs, new pods automatically receive labels, and the router updates its view in real time without manual configuration.

KV Transfer Coordination

ModelRoute includes the KV connector configuration:

kvConnector:
  type: mooncake

Mooncake workflow:

Prefill startup reads GROUP_NAME and ROLE_ID, builds engine_id=${GROUP_NAME}_${ROLE_ID}, and starts the Mooncake server listening on the configured port.

Decode startup reads the same GROUP_NAME but a different ROLE_ID, sets kv_role=kv_consumer, and connects to the producer’s engine ID.

During request processing, Prefill generates KV cache, Mooncake transfers it, and Decode consumes it to continue generation.

Traffic Policy

trafficPolicy:
  timeout: "300s"
  retry:
    attempts: 3
    retryInterval: "150ms"

These settings ensure reliable hand‑off between Prefill and Decode, allowing long generation sessions and automatic retries on failure.

Kthena Orchestration Advantages

Declarative orchestration : a single ModelServing manifest replaces manual creation of multiple Deployments and Services.

Automatic label injection : the controller adds group, role, and role‑id labels to each pod.

Flexible P/D ratio adjustment : changing the replicas field for the Prefill or Decode role instantly scales the corresponding stage.

Automatic service discovery : the pdGroup mechanism pairs Prefill and Decode pods without manual address configuration.

KV coordination : deep integration with Mooncake guarantees correct KV routing.

Scaling Examples

Baseline 1P1D configuration:

roles:
- name: prefill
  replicas: 1
- name: decode
  replicas: 1

Increase Prefill replicas for higher input‑throughput (2P1D):

roles:
- name: prefill
  replicas: 2
- name: decode
  replicas: 1

Increase Decode replicas for higher output‑throughput (1P2D):

roles:
- name: prefill
  replicas: 1
- name: decode
  replicas: 2

Horizontal scaling with two independent 1P1D groups (2×(1P1D)) doubles overall throughput.

Deployment Practice

Model Preparation

Download the DeepSeek‑V4‑Flash weights to /models/DeepSeek-V4-Flash-w8a8-mtp on every compute node. The directory must contain the model weights, configuration file, and chat_template.jinja.

# Install ModelScope
pip install modelscope

# Download model
modelscope download --model Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp --local_dir /models/DeepSeek-V4-Flash-w8a8-mtp

# Or using git‑lfs
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp.git /models/DeepSeek-V4-Flash-w8a8-mtp
cd /models/DeepSeek-V4-Flash-w8a8-mtp
git lfs pull

Ensure the path is identical on all nodes (e.g., via shared NFS) and verify the files exist.

Full Deployment Workflow

Step 1: Create ConfigMap containing startup scripts and environment variables. kubectl apply -f config.yaml Step 2: Deploy ModelServing which creates Prefill and Decode pods with the injected labels. kubectl apply -f deepseek-serv.yaml Step 3: Apply ModelRoute to define routing rules and KV connector.

kubectl apply -f modelRoute.yaml

Verification

# Check ModelServing status
kubectl get modelserving deepseekv4-pd

# List all related pods
kubectl get pods -l modelserving.volcano.sh/name=deepseekv4-pd

# Inspect pod labels
kubectl get pods -l modelserving.volcano.sh/name=deepseekv4-pd -o wide

# View logs
kubectl logs -l modelserving.volcano.sh/role=prefill
kubectl logs -l modelserving.volcano.sh/role=decode

Scaling Operations

Increase Prefill replicas:

kubectl patch modelserving deepseekv4-pd --type='json' \
  -p='[{"op":"replace","path":"/spec/template/roles/0/replicas","value":2}]'

Increase Decode replicas (similar patch on /spec/template/roles/1/replicas).

Conclusion

The practice validates that Kthena can successfully deploy the DeepSeek‑V4‑Flash model on Ascend NPU with full Prefill‑Decode separation, automatic KV cache coordination via Mooncake, and flexible scaling of Prefill and Decode replicas. Kthena’s declarative configuration, automatic service discovery, and built‑in KV handling provide a deterministic, high‑performance solution for large‑model serving on Ascend hardware.

Deployment templates: https://github.com/volcano-sh/kthena/tree/main/examples/models/deepseek-v4-flash

Kthena repository: https://github.com/volcano-sh/kthena

Official site: https://kthena.volcano.sh/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MooncakeKV cachePrefill-DecodeKthenaAscend NPUDeepSeek V4ModelRoute
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.