Deploy Kimi 2.5 LLM on Alibaba Cloud with SGLang, RBG, and Openclaw

This guide walks through preparing the Kimi 2.5 model, uploading it to OSS, configuring persistent storage, and using SGLang, RoleBasedGroup, and Openclaw to deploy a production‑grade inference service on Alibaba Cloud Kubernetes with step‑by‑step commands and YAML examples.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Deploy Kimi 2.5 LLM on Alibaba Cloud with SGLang, RBG, and Openclaw

Background

Kimi 2.5 was open‑sourced on 2026‑01‑27. Built on the Kimi‑K2‑Base foundation, it was pre‑trained on ~15 trillion mixed visual‑text tokens, providing strong visual perception, logical reasoning, code generation, and agentic execution. The model supports two modes: an "instant" low‑latency chat mode and a "thinking" mode for deep planning.

SGLang Inference Framework

SGLang is a high‑performance inference framework for large language models. Key capabilities relevant to Kimi 2.5 include:

Native PD‑separation : decouples Prefill and Decode stages.

Efficient MoE kernel : integrates DeepEP, supports Expert Parallelism and All‑to‑All communication.

Advanced scheduling : Continuous Batching and Overlap Schedule to maximize GPU utilization.

Distributed optimization : Tensor Parallelism and Expert Parallelism for scaling to thousands of GPUs.

Repository: https://github.com/sgl-project/sglang

Openclaw Communication Gateway

Openclaw is an open‑source cross‑platform gateway that connects AI agents to mainstream IM platforms (WhatsApp via Baileys, Telegram, Discord, iMessage, etc.) and can be extended via plugins. It provides a local control panel, media transfer, speech transcription, group management, multi‑agent routing, and streaming response capabilities.

Repository: https://github.com/openclaw/openclaw

RoleBasedGroup (RBG) Orchestration Engine

RBG is a cloud‑native orchestration engine originating from the SGLang community. It treats inference services as role‑based organisms, managing lifecycle, cross‑instance communication, and Prefill/Decode role scheduling through a declarative API.

Project address: https://github.com/sgl-project/rbg

Model File Preparation

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/moonshotai/Kimi-K2.5.git
cd Kimi-K2.5/
git lfs pull

Upload Model to OSS

After creating an OSS bucket, upload the model directory:

ossutil mkdir oss://YOUR_BUCKET_NAME/models/Kimi-K2.5
ossutil cp -r ./Kimi-K2.5 oss://YOUR_BUCKET_NAME/models/Kimi-K2.5

PersistentVolume (PV) and PersistentVolumeClaim (PVC) Configuration

Create a secret with OSS credentials and define the PV/PVC. Replace placeholders with your actual values.

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: YOUR_OSS_AK   # AccessKey ID
  akSecret: YOUR_OSS_SK   # AccessKey Secret
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: kimi-k2-5
  labels:
    alicloud-pvname: kimi-k2-5
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: kimi-k2-5
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      fuseType: ossfs2
      bucket: YOUR_BUCKET_NAME
      url: YOUR_BUCKET_ENDPOINT
      path: /models/Kimi-K2.5
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: kimi-k2-5
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 5Gi
  selector:
    matchLabels:
      alicloud-pvname: kimi-k2-5

Apply the configuration:

kubectl create -f kimi-k2-5-pv-pvc.yaml

Deploy Inference Service with RBG

apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: kimi-k2-5
spec:
  roles:
  - name: server
    replicas: 1
    template:
      spec:
        containers:
        - name: sglang
          image: ac2-mirror-registry.cn-hangzhou.cr.aliyuncs.com/evaluate/sglang:nightly-dev-20260129-0998de08
          command:
          - sh
          - -c
          - "python3 -m sglang.launch_server --model-path /models/Kimi-K2.5 --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --tp-size 8 --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2"
          ports:
          - containerPort: 8080
            name: http
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: 8080
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: "8"
            requests:
              nvidia.com/gpu: "8"
          volumeMounts:
          - mountPath: /models/Kimi-K2.5
            name: model
          - mountPath: /dev/shm
            name: dshm
        volumes:
        - name: model
          persistentVolumeClaim:
            claimName: kimi-k2-5
        - name: dshm
          emptyDir:
            medium: Memory
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: kimi-k2-5
  name: kimi-k2-5
  namespace: default
spec:
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    rolebasedgroup.workloads.x-k8s.io/name: kimi-k2-5
  type: ClusterIP

Apply the service definition: kubectl create -f kimi-k2-5-rbg.yaml Verify the pod is running:

kubectl get po -l rolebasedgroup.workloads.x-k8s.io/name=kimi-k2-5
# Expected output example
# NAME                     READY   STATUS    RESTARTS   AGE
# kimi-k2-5-server-0      1/1     Running   0          8h

Validate Inference Service

# Port‑forward the service
kubectl port-forward svc/kimi-k2-5 8080:8080

# Send a test request
curl http://127.0.0.1:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/Kimi-K2.5",
    "prompt": "云原生是什么",
    "max_tokens": 10
  }'

The response is a JSON object containing a short completion about cloud‑native architecture.

Install Openclaw

# macOS
curl -fsSL https://openclaw.bot/install.sh | bash

# Windows (PowerShell)
iwr -useb https://openclaw.ai/install.ps1 | iex

After installation the gateway runs at http://127.0.0.1:18789/.

Configure Openclaw to Use RBG Provider

Edit ~/.openclaw/openclaw.json and add the following configuration (replace any existing "models" section):

{
  "models": {
    "mode": "merge",
    "providers": {
      "rbg": {
        "baseUrl": "http://localhost:8080/v1",
        "apiKey": "rbg",
        "api": "openai-completions",
        "models": [
          {"id": "Kimi-K2.5", "name": "Kimi K2.5"}
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {"primary": "rbg/Kimi-K2.5"},
      "models": {"rbg/Kimi-K2.5": {"alias": "Kimi K2.5"}}
    }
  }
}

Restart the gateway: openclaw gateway restart List loaded models to confirm the RBG provider is available:

openclaw models list
# Expected output includes rbg/Kimi-K2.5

Summary

Using RoleBasedGroup, Kimi 2.5 can be deployed on Alibaba Cloud ACK with zero‑downtime in‑place upgrades, intelligent pre‑warming, and seamless version updates, making it suitable for long‑running production inference services that require frequent tuning.

AILLMdeploymentKubernetesSGLangKimiOpenClaw
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.