21 min read

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG

This article details how to deploy the 235‑billion‑parameter Qwen3‑235B model using PD‑separation and MoE techniques, explains the associated challenges, and demonstrates a production‑grade solution built on the high‑performance SGLang inference engine and the RoleBasedGroup (RBG) orchestration framework, complete with benchmark results and best‑practice YAML examples.

Alibaba Cloud Infrastructure

Jan 21, 2026

Boost LLM Performance: Deploy Qwen3‑235B with PD‑Separation, MoE, SGLang & RBG

Qwen3‑235B Overview

Qwen3‑235B‑A22‑FP8 is Alibaba Cloud's flagship large language model with 235 billion parameters. Architecture upgrades and training optimizations improve general capability, multilingual coverage (119 languages, CSimpleQA 84.3), human‑preference alignment (Arena‑Hard v2 79.2) and native 256K‑token context length.

General capability boost : gains across instruction following, logical reasoning, text understanding, math reasoning, scientific computation, code generation and tool use.

Multilingual coverage : supports 119 languages; CSimpleQA score 84.3 (vs. 71.1 for DeepSeek‑V3‑0324).

Human‑preference alignment : Arena‑Hard v2 score 79.2, a 52 % increase over the previous version.

Long‑context capacity : native 256K token context (262 144 tokens).

Deployment Challenges

Massive compute cost per forward pass.

Memory pressure from KV‑Cache growing linearly with sequence length.

Latency requirements (time‑to‑first‑token, TTFT) in long‑context scenarios.

PD‑Separation and MoE Solutions

PD‑Separation : splits the prefilling (compute‑intensive) and decoding (memory‑intensive) stages, allowing independent scaling and reducing both TTFT and per‑token latency (TPOT).

MoE architecture : activates only 22 B of the 235 B parameters per forward pass, dramatically cutting compute and memory while preserving model capacity.

New Challenges Introduced by PD‑Separation + MoE

Latency‑sensitive : non‑linear TTFT growth and All‑to‑All communication overhead in expert parallelism.

Role coordination complexity : Prefill and Decode must be proportionally paired; partial upgrades can cause protocol incompatibility; fault recovery lacks automatic end‑to‑end handling.

Topology stability : Cross‑NVLink or NUMA pod placement can increase latency; rolling updates cause topology drift and performance jitter.

Service discovery & orchestration : Requires dynamic role awareness; external registries (etcd/Consul) add coupling and failure surface.

Successful deployment hinges on two capabilities: • An inference engine that efficiently executes PD‑separation and MoE – SGLang ; • An orchestration platform that reliably coordinates multiple roles – RoleBasedGroup (RBG) .

SGLang – High‑Performance LLM Inference Engine

Native support for PD‑separation architecture.

Integrated high‑efficiency MoE kernel (DeepEP) with expert parallelism and optimized All‑to‑All communication.

Advanced scheduling (continuous and overlap batching) to maximize GPU utilization.

Distributed optimizations including Tensor Parallelism and Expert Parallelism for scaling to thousand‑GPU clusters.

RoleBasedGroup (RBG) – Elastic Role‑Based Orchestration

Project repository: https://github.com/sgl-project/rbg

RBG treats an LLM inference service as a topology‑aware, stateful organism composed of cooperating roles (Prefill, Decode, Router, etc.). It provides the SCOPE framework:

S – Stable : deterministic operations respecting hardware topology.

C – Coordination : declarative role‑dependency engine for deployment, upgrade and fault coordination.

O – Orchestration : built‑in service discovery and ordered startup.

P – Performance : topology‑aware scheduling, GPU‑NVLink priority, affinity constraints, and short‑circuit reads.

E – Extensible : declarative API and plugin mechanism for future architectures without core code changes.

Deploying a PD‑Separated Service with RBG + SGLang

Example YAML (truncated) demonstrates role definitions, dependencies, resource requests and launch commands. The router role depends on both Prefill and Decode, ensuring correct start order.

apiVersion: workloads-x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: sglang-pd-demo
spec:
  roles:
  - name: router
    replicas: 1
    dependencies: ["decode", "prefill"]
    template:
      spec:
        containers:
        - name: scheduler
          image: ac2-mirror-registry.cn-hangzhou.aliyuncs.com/evaluate/mooncake:0.3.7.post2-sglang0.5.5.post3-deepep
          command:
          - python
          - -m
          - sglang_router.launch_router
          - --pd-disaggregation
          - --prefill http://sglang-pd-demo-prefill-0.s-sglang-pd-demo-prefill:30001
          - "8991"
          - --decode http://sglang-pd-demo-decode-0.s-sglang-pd-demo-decode:30001
          - --host 0.0.0.0
          - --port "8000"
          - --policy cache_aware
  - name: prefill
    replicas: 1
    template:
      spec:
        volumes:
        - name: model
          persistentVolumeClaim:
            claimName: qwen3-235b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
        containers:
        - name: sglang-prefill
          image: ac2-mirror-registry.cn-hangzhou.aliyuncs.com/evaluate/mooncake:0.3.7.post2-sglang0.5.5.post3-deepep
          command:
          - python3
          - -m
          - sglang.launch_server
          - --model-path /models/Qwen3-235B-A22B-Instruct-2507-FP8
          - --port "30001"
          - --base-gpu-id "0"
          - --disaggregation-mode prefill
          - --disable-radix-cache
          - --disaggregation-bootstrap-port "8991"
          - --host $(POD_IP)
          - --mem-fraction-static "0.75"
          - --tp-size "4"
          - --ep-size "4"
          - --enable-dp-attention
          - --dp-size "4"
          - --moe-a2a-backend deepep
          - --cuda-graph-max-bs "128"
          - --chunked-prefill-size "16000"
          - --load-balance-method round_robin
  - name: decode
    replicas: 1
    template:
      spec:
        volumes:
        - name: model
          persistentVolumeClaim:
            claimName: qwen3-235b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
        containers:
        - name: sglang-decode
          image: ac2-mirror-registry.cn-hangzhou.aliyuncs.com/evaluate/mooncake:0.3.7.post2-sglang0.5.5.post3-deepep
          command:
          - python3
          - -m
          - sglang.launch_server
          - --model-path /models/Qwen3-235B-A22B-Instruct-2507-FP8
          - --port "30001"
          - --base-gpu-id "0"
          - --disaggregation-mode decode
          - --disable-radix-cache
          - --host $(POD_IP)
          - --mem-fraction-static "0.75"
          - --tp-size "8"
          - --ep-size "8"
          - --enable-dp-attention
          - --dp-size "8"
          - --moe-a2a-backend deepep
          - --attention-backend flashinfer
          - --cuda-graph-max-bs "32"
          - --load-balance-method shortest_queue
          - --prefill-round-robin-balance
          - --max-running-requests "300"
          - --decode-log-interval "10"

After applying the YAML, the following pods become ready:

NAME                              READY   STATUS    RESTARTS   AGE
sglang-pd-demo-decode-0           1/1     Running   0          62m
sglang-pd-demo-prefill-0          1/1     Running   0          62m
sglang-pd-demo-router-0           1/1     Running   0          21m

Local testing can be performed with kubectl port-forward and a curl request to the model endpoint.

Benchmark Results

On an Alibaba Cloud H20 12‑GPU cluster (4 GPU for Prefill, 8 GPU for Decode) the PD‑separated service achieved a 1.93× throughput improvement over a TP4 baseline (4 GPU) under SLO constraints TTFT ≤ 5 s, ITL ≤ 40 ms, with input length 3500 tokens and output length 1500 tokens.

Performance Optimizations

Parallelism & load‑balancing using Expert Parallelism (EP) to boost resource utilization.

Efficient compute kernels (DeepGemm, FlashInfer) for matrix multiplication and attention.

Overlap scheduling reduces average TTFT by 24.6 %.

Data‑parallel load‑balancing improves token throughput by 7.5 %.

Enhanced observability in SGLang with request‑level and pipeline‑level tracing.

Best‑Practice Deployment Steps

Install RBG (see https://github.com/sgl-project/rbg/blob/main/doc/install.md).

Prepare container images (the example uses the Mooncake image with SGLang 0.5.5 and DeepEP).

Apply the example YAML (full file at https://github.com/AliyunContainerService/ai-models-on-ack/blob/main/llm/sglang/sglang-pd-qwen3-235b.yaml) to create the RoleBasedGroup.

Verify pod readiness and perform a local kubectl port-forward followed by a curl request to the /v1/completions endpoint.

Future Outlook

External KV‑Cache storage: unload Decode KV‑Cache to a distributed cache pool to break single‑GPU memory limits for high‑concurrency agent sessions.

Dynamic cache orchestration: combine RBG’s topology‑aware scheduling with cross‑node KV‑Cache sharing to reduce memory duplication.

Agent‑aware scheduling: extend SGLang’s scheduler to allocate KV‑Cache based on agent request hotness, improving tool‑calling latency.

Performance AI LLM Kubernetes Inference

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.