13 min read

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

This guide explains how to deploy a production‑ready DeepSeek‑R1 inference service on Alibaba Cloud ACK using KServe, covering model preparation, storage configuration, service deployment, observability, autoscaling, model acceleration, gray‑release and GPU‑shared inference.

Alibaba Cloud Infrastructure

Feb 8, 2025

Deploying a Production‑Ready DeepSeek‑R1 Inference Service on Alibaba Cloud ACK with KServe

Background

DeepSeek‑R1 is the first generation inference model released by DeepSeek, achieving strong performance on mathematical reasoning, programming contests, creative writing, and general QA tasks. The model can be distilled into smaller variants that surpass many open‑source alternatives.

Key Components

KServe is an open‑source cloud‑native model serving platform that simplifies deployment of machine‑learning models on Kubernetes. Arena provides a lightweight workflow for data preparation, model development, training, and prediction, and integrates tightly with Alibaba Cloud services such as GPU sharing and CPFS.

Prerequisites

A GPU‑enabled Kubernetes cluster on Alibaba Cloud ACK.

ack‑kserve component installed.

Arena client installed.

GPU instance type A10 (e.g., ecs.gn7i-c8g1.2xlarge) is recommended.

Step 1: Prepare the DeepSeek‑R1‑Distill‑Qwen‑7B Model

# Confirm git‑lfs is installed
# If not, install via yum or apt
git lfs install
# Clone the model repository
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
# Download model files
cd DeepSeek-R1-Distill-Qwen-7B/
git lfs pull

Upload the model directory to an OSS bucket:

# Create OSS directory
ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
# Upload files
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B

Create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) named llm-model that points to the OSS path (see the “Using OSS static storage volume” documentation).

Step 2: Deploy the Inference Service

arena serve kserve \
    --name=deepseek \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

The command creates an InferenceService named deepseek. Expected output includes a confirmation message and job submission details.

Step 3: Verify the Service

# Check service status
arena serve get deepseek

The output should show the service is Running with one replica and provide the address (e.g., http://deepseek-default.example.com).

# Get Nginx Ingress IP and service hostname
NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice deepseek -o jsonpath='{.status.url}' | cut -d '/' -f3)
# Send a test request
curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

The response should contain a valid chat completion JSON.

Observability

Enable Prometheus metrics by adding --enable-prometheus=true to the arena serve kserve command. vLLM and KServe expose a range of inference metrics that can be visualized in the Prometheus dashboard.

Elastic Autoscaling

KServe integrates with Kubernetes HPA. Example command to scale based on GPU utilization:

arena serve kserve \
    --name=deepseek \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
    --scale-target=80 \
    --min-replicas=1 \
    --max-replicas=2 \
    --data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager"

This config triggers scaling when GPU SM utilization exceeds 80%.

Model Acceleration

Using Fluid to cache model files can reduce cold‑start latency by over 50% on A10 GPUs compared with direct OSS access.

Gray Release

ACK supports traffic‑percentage or header‑based gray‑release strategies for inference services. Refer to the official gray‑release documentation for configuration details.

GPU‑Shared Inference

Since DeepSeek‑R1‑Distill‑Qwen‑7B requires only ~14 GB VRAM, multiple services can share a single high‑end GPU (e.g., A100) using GPU‑sharing techniques to improve utilization.

Conclusion

DeepSeek‑R1 delivers strong performance on a variety of tasks. This article demonstrated how to deploy a production‑grade DeepSeek inference service on Alibaba Cloud ACK with KServe, covering model deployment, observability, autoscaling, model acceleration, gray‑release, and GPU‑shared inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM DeepSeek GPU inference model serving KServe

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.