How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK

This guide walks through deploying a Bloom 7B1 large language model for distributed inference on Alibaba Cloud Container Service (ACK) using DeepSpeed, Arena, and Kubernetes, covering environment setup, model configuration, service launch, verification, and Ingress exposure.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK

Overview

This guide shows how to deploy the open‑source bigscience/bloom-7b1 model on Alibaba Cloud Container Service for Kubernetes (ACK) using DeepSpeed Inference for tensor‑parallel distributed inference.

Components

Arena : Kubernetes‑based MLOps platform for managing GPU resources, data, training and serving.

Ingress : Kubernetes resource that exposes services to external traffic.

DeepSpeed Inference : Microsoft engine that provides tensor‑parallel model parallelism and optimized kernels for transformer models.

DJLServing : HTTP model‑serving framework that can run DeepSpeed‑accelerated models.

Step‑by‑Step Procedure

1. Environment preparation

Create a GPU‑enabled Kubernetes cluster on ACK.

Install the Cloud‑Native AI suite (Arena, Ingress controller, etc.).

2. Model configuration

Two files are required: serving.properties – specifies the inference engine, tensor parallel degree (set to 2 for two GPUs), model identifier (OSS path or HuggingFace repo), data type (fp16) and other time‑outs. model.py – defines get_model to load the model and tokenizer, convert the model with deepspeed.init_inference, and build a HuggingFace pipeline; handle processes incoming requests and returns JSON output.

engine=DeepSpeed
option.parallel_loading=true
option.tensor_parallel_degree=2
option.model_loading_timeout=600
option.model_id=model/LLM/bloom-7b1/deepspeed/bloom-7b1
option.data_type=fp16
option.max_new_tokens=100
import os, torch, logging, deepspeed
from djl_python.inputs import Input
from djl_python.outputs import Output
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

logging.basicConfig(level=logging.DEBUG)

predictor = None

def get_model(properties):
    model_id = properties.get("model_id")
    mp_size = int(properties.get("tensor_parallel_degree", "2"))
    local_rank = int(os.getenv('OMPI_COMM_WORLD_LOCAL_RANK', '0'))
    model = AutoModelForCausalLM.from_pretrained(model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = deepspeed.init_inference(model, mp_size=mp_size, dtype=torch.float16,
                                     replace_method='auto', replace_with_kernel_inject=True)
    return pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)

def handle(inputs: Input):
    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())
    if inputs.is_empty():
        return None
    data = inputs.get_as_string()
    output = Output()
    output.add_property("content-type", "application/json")
    result = predictor(data, do_sample=True, max_new_tokens=50)
    return output.add(result)

Upload these files (and optionally the model weights) to an OSS bucket, then create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that mount the OSS path into the container.

3. Launch the service

arena serve custom \
    --name=bloom7b1-deepspeed \
    --gpus=2 \
    --version=alpha \
    --replicas=1 \
    --restful-port=8080 \
    --data=bloom7b1-pvc:/model \
    --image=ai-studio-registry-vpc.cn-beijing.cr.aliyuncs.com/kube-ai/djl-serving:2023-05-19 \
    "djl-serving -m"

Check pod status with kubectl get pod and view logs to confirm that two processes (rank 0 and 1) are started, each loading the model and converting it to a DeepSpeed kernel.

4. Service verification

kubectl -n default-group port-forward svc/bloom7b1-deepspeed-alpha 9090:8080
curl -X POST http://127.0.0.1:9090/predictions/deepspeed \
    -H "Content-Type: text/plain" -d "I'm very thirsty, I need"

The response contains generated text from Bloom‑7B1, confirming successful inference.

5. Expose via Ingress

Create an Ingress resource (via console or YAML) that maps a domain name to the service, enabling external access and load‑balancing.

curl -X POST https://deepspeed-bloom7b1.example.com/predictions/deepspeed \
    -H "Content-Type: text/plain" -d "I'm very thirsty, I need"

Conclusion

The example demonstrates that with Arena, DeepSpeed Inference, and ACK, a Bloom‑7B1 model can be served on multiple GPUs with low latency, high throughput, and elastic scaling. Alternative distributed inference solutions such as FastTransformer + Triton are also viable for future cost‑effective, high‑performance deployments.

cloud-nativeLLMDistributed inferenceKubernetesDeepSpeedACKArena
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.