Cloud Native 24 min read

Deploying Qwen 3.5 Multimodal Model on Alibaba Cloud ACK with RoleBasedGroup

This guide details how to deploy the open‑source Qwen 3.5‑397B‑A17B multimodal LLM on Alibaba Cloud ACK using the RoleBasedGroup (RBG) engine, covering model preparation, Kubernetes resources, role‑based orchestration, performance tuning, and benchmark testing.

Alibaba Cloud Infrastructure

Feb 23, 2026

Deploying Qwen 3.5 Multimodal Model on Alibaba Cloud ACK with RoleBasedGroup

On February 16, 2026 Alibaba released the open‑source Qwen 3.5‑397B‑A17B model, a generational leap that introduces a gated‑delta architecture with sparse MoE, enabling only 170 B parameters to be active per forward pass while the total parameter count reaches 397 B. This design reduces GPU memory usage by 60% and boosts throughput up to 19× for ultra‑long contexts (256 K tokens, extendable to 1 M tokens).

Key architectural innovations include:

Memory‑friendly activation : only ~5% of parameters are active, allowing deployment on a broader range of GPU instances.

Throughput gains : 8.6× increase for 32 K context and up to 19× for 256 K context, breaking long‑text inference bottlenecks.

However, the massive model size introduces infrastructure challenges such as All‑to‑All MoE communication, high bandwidth for 170 B active parameters, and extreme sensitivity to GPU topology and NUMA placement, making cross‑node scheduling jitter a potential performance killer.

The model also adds native multimodal capabilities, processing visual and textual tokens end‑to‑end without a separate vision encoder. Benchmarks show superior scores on MMLU‑Pro (87.8), GPQA (88.4), and IFBench (76.5), as well as leading performance on MathVison, RealWorldQA, CC_OCR, RefCOCO, and MLVU video tasks. Visual programming is enabled: hand‑drawn UI sketches can be turned directly into front‑end code.

Agent workloads benefit from the model’s visual intelligence, supporting system‑level agents that scale to millions of asynchronous RL instances with 3–5× speedup, and cross‑device operations on mobile apps and PCs. These agents require strict Prefill/Decode (PD) role ratios (e.g., 2:1) and deterministic low‑latency KV‑Cache handling, exposing three core infrastructure challenges:

State sensitivity – KV‑Cache must survive pod restarts and rolling updates.

Topology stability – PD workloads need hardware‑aware scheduling to avoid performance loss across NUMA or rack boundaries.

Coordinated scaling – Prefill and Decode roles must be scaled proportionally to prevent routing imbalance.

To address these challenges, the article introduces RoleBasedGroup (RBG) , a cloud‑native AI orchestration engine contributed by Alibaba Cloud, Xiaohongshu, SquirrelFuture, iFlytek, and Nanjing University. RBG treats a LLM inference service as a set of stateful "roles" rather than a stateless pod collection, providing five core capabilities (SCOPE):

Stable : topology‑aware role IDs ensure zero‑drift upgrades and KV‑Cache preservation.

Coordination : declarative policies enforce role‑pair ratios, synchronized rollouts, and fault‑tolerant failover.

Orchestration : explicit role dependencies and startup ordering with built‑in service discovery.

Performance : GPU‑topology‑aware packing, affinity/anti‑affinity constraints, and intra‑role load balancing.

Extensible : plug‑in architecture allows new role templates without changing core code.

The guide walks through a complete deployment on an ACK cluster (Kubernetes v1.28+ with GPU nodes):

Prepare the model by cloning the ModelScope repository with git lfs and uploading the files to an OSS bucket.

Create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that bind the OSS bucket to /models/qwen3.5.

Install RBG (>= v0.6.0) and configure a RoleBasedGroup resource that defines a server role (8‑GPU TP parallelism) for SGLang or vLLM engines.

Apply the YAML to ACK; RBG handles pod naming, service discovery, and in‑place upgrades, preserving KV‑Cache across updates.

Verification steps include port‑forwarding the service and sending multimodal inference requests via curl, with example responses for image and video understanding.

Advanced usage demonstrates RBG’s benchmark tool:

Deploy a benchmark job that runs traffic scenarios (e.g., D(100,1000), D(500,500)) with varying concurrency.

Collect logs, configuration, and results through kubectl rbg llm benchmark commands.

Visualize performance metrics (throughput, latency, token‑per‑second) via the built‑in dashboard.

In summary, deploying Qwen 3.5 with RBG on ACK yields three major benefits:

Higher efficiency : topology‑aware scheduling reduces TTFT by over 30%.

Simpler operations : declarative role definitions replace complex StatefulSet setups.

Greater stability : in‑place upgrades and coordinated PD scaling eliminate service interruptions and performance jitter.

The article concludes that role‑based, topology‑aware orchestration will become the de‑facto standard for cloud‑native AI infrastructure as multimodal and mixed‑precision models continue to evolve.

Kubernetes benchmarking multimodal LLM Cloud Native AI RoleBasedGroup Qwen3.5

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.