Deploy Kimi 2.5 LLM on Alibaba Cloud with SGLang, RBG, and Openclaw
This guide walks through preparing the Kimi 2.5 model, uploading it to OSS, configuring persistent storage, and using SGLang, RoleBasedGroup, and Openclaw to deploy a production‑grade inference service on Alibaba Cloud Kubernetes with step‑by‑step commands and YAML examples.
Background
Kimi 2.5 was open‑sourced on 2026‑01‑27. Built on the Kimi‑K2‑Base foundation, it was pre‑trained on ~15 trillion mixed visual‑text tokens, providing strong visual perception, logical reasoning, code generation, and agentic execution. The model supports two modes: an "instant" low‑latency chat mode and a "thinking" mode for deep planning.
SGLang Inference Framework
SGLang is a high‑performance inference framework for large language models. Key capabilities relevant to Kimi 2.5 include:
Native PD‑separation : decouples Prefill and Decode stages.
Efficient MoE kernel : integrates DeepEP, supports Expert Parallelism and All‑to‑All communication.
Advanced scheduling : Continuous Batching and Overlap Schedule to maximize GPU utilization.
Distributed optimization : Tensor Parallelism and Expert Parallelism for scaling to thousands of GPUs.
Repository: https://github.com/sgl-project/sglang
Openclaw Communication Gateway
Openclaw is an open‑source cross‑platform gateway that connects AI agents to mainstream IM platforms (WhatsApp via Baileys, Telegram, Discord, iMessage, etc.) and can be extended via plugins. It provides a local control panel, media transfer, speech transcription, group management, multi‑agent routing, and streaming response capabilities.
Repository: https://github.com/openclaw/openclaw
RoleBasedGroup (RBG) Orchestration Engine
RBG is a cloud‑native orchestration engine originating from the SGLang community. It treats inference services as role‑based organisms, managing lifecycle, cross‑instance communication, and Prefill/Decode role scheduling through a declarative API.
Project address: https://github.com/sgl-project/rbg
Model File Preparation
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/moonshotai/Kimi-K2.5.git
cd Kimi-K2.5/
git lfs pullUpload Model to OSS
After creating an OSS bucket, upload the model directory:
ossutil mkdir oss://YOUR_BUCKET_NAME/models/Kimi-K2.5
ossutil cp -r ./Kimi-K2.5 oss://YOUR_BUCKET_NAME/models/Kimi-K2.5PersistentVolume (PV) and PersistentVolumeClaim (PVC) Configuration
Create a secret with OSS credentials and define the PV/PVC. Replace placeholders with your actual values.
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: YOUR_OSS_AK # AccessKey ID
akSecret: YOUR_OSS_SK # AccessKey Secret
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: kimi-k2-5
labels:
alicloud-pvname: kimi-k2-5
spec:
capacity:
storage: 5Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: kimi-k2-5
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
fuseType: ossfs2
bucket: YOUR_BUCKET_NAME
url: YOUR_BUCKET_ENDPOINT
path: /models/Kimi-K2.5
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: kimi-k2-5
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
alicloud-pvname: kimi-k2-5Apply the configuration:
kubectl create -f kimi-k2-5-pv-pvc.yamlDeploy Inference Service with RBG
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
name: kimi-k2-5
spec:
roles:
- name: server
replicas: 1
template:
spec:
containers:
- name: sglang
image: ac2-mirror-registry.cn-hangzhou.cr.aliyuncs.com/evaluate/sglang:nightly-dev-20260129-0998de08
command:
- sh
- -c
- "python3 -m sglang.launch_server --model-path /models/Kimi-K2.5 --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --tp-size 8 --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2"
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 8080
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: "8"
requests:
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /models/Kimi-K2.5
name: model
- mountPath: /dev/shm
name: dshm
volumes:
- name: model
persistentVolumeClaim:
claimName: kimi-k2-5
- name: dshm
emptyDir:
medium: Memory
---
apiVersion: v1
kind: Service
metadata:
labels:
app: kimi-k2-5
name: kimi-k2-5
namespace: default
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
rolebasedgroup.workloads.x-k8s.io/name: kimi-k2-5
type: ClusterIPApply the service definition: kubectl create -f kimi-k2-5-rbg.yaml Verify the pod is running:
kubectl get po -l rolebasedgroup.workloads.x-k8s.io/name=kimi-k2-5
# Expected output example
# NAME READY STATUS RESTARTS AGE
# kimi-k2-5-server-0 1/1 Running 0 8hValidate Inference Service
# Port‑forward the service
kubectl port-forward svc/kimi-k2-5 8080:8080
# Send a test request
curl http://127.0.0.1:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/models/Kimi-K2.5",
"prompt": "云原生是什么",
"max_tokens": 10
}'The response is a JSON object containing a short completion about cloud‑native architecture.
Install Openclaw
# macOS
curl -fsSL https://openclaw.bot/install.sh | bash
# Windows (PowerShell)
iwr -useb https://openclaw.ai/install.ps1 | iexAfter installation the gateway runs at http://127.0.0.1:18789/.
Configure Openclaw to Use RBG Provider
Edit ~/.openclaw/openclaw.json and add the following configuration (replace any existing "models" section):
{
"models": {
"mode": "merge",
"providers": {
"rbg": {
"baseUrl": "http://localhost:8080/v1",
"apiKey": "rbg",
"api": "openai-completions",
"models": [
{"id": "Kimi-K2.5", "name": "Kimi K2.5"}
]
}
}
},
"agents": {
"defaults": {
"model": {"primary": "rbg/Kimi-K2.5"},
"models": {"rbg/Kimi-K2.5": {"alias": "Kimi K2.5"}}
}
}
}Restart the gateway: openclaw gateway restart List loaded models to confirm the RBG provider is available:
openclaw models list
# Expected output includes rbg/Kimi-K2.5Summary
Using RoleBasedGroup, Kimi 2.5 can be deployed on Alibaba Cloud ACK with zero‑downtime in‑place upgrades, intelligent pre‑warming, and seamless version updates, making it suitable for long‑running production inference services that require frequent tuning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
