How to Deploy Distributed LLM Inference with DeepSpeed on Alibaba Cloud ACK
This guide walks through deploying a Bloom 7B1 large language model for distributed inference on Alibaba Cloud Container Service (ACK) using DeepSpeed, Arena, and Kubernetes, covering environment setup, model configuration, service launch, verification, and Ingress exposure.
Overview
This guide shows how to deploy the open‑source bigscience/bloom-7b1 model on Alibaba Cloud Container Service for Kubernetes (ACK) using DeepSpeed Inference for tensor‑parallel distributed inference.
Components
Arena : Kubernetes‑based MLOps platform for managing GPU resources, data, training and serving.
Ingress : Kubernetes resource that exposes services to external traffic.
DeepSpeed Inference : Microsoft engine that provides tensor‑parallel model parallelism and optimized kernels for transformer models.
DJLServing : HTTP model‑serving framework that can run DeepSpeed‑accelerated models.
Step‑by‑Step Procedure
1. Environment preparation
Create a GPU‑enabled Kubernetes cluster on ACK.
Install the Cloud‑Native AI suite (Arena, Ingress controller, etc.).
2. Model configuration
Two files are required: serving.properties – specifies the inference engine, tensor parallel degree (set to 2 for two GPUs), model identifier (OSS path or HuggingFace repo), data type (fp16) and other time‑outs. model.py – defines get_model to load the model and tokenizer, convert the model with deepspeed.init_inference, and build a HuggingFace pipeline; handle processes incoming requests and returns JSON output.
engine=DeepSpeed
option.parallel_loading=true
option.tensor_parallel_degree=2
option.model_loading_timeout=600
option.model_id=model/LLM/bloom-7b1/deepspeed/bloom-7b1
option.data_type=fp16
option.max_new_tokens=100 import os, torch, logging, deepspeed
from djl_python.inputs import Input
from djl_python.outputs import Output
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
logging.basicConfig(level=logging.DEBUG)
predictor = None
def get_model(properties):
model_id = properties.get("model_id")
mp_size = int(properties.get("tensor_parallel_degree", "2"))
local_rank = int(os.getenv('OMPI_COMM_WORLD_LOCAL_RANK', '0'))
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = deepspeed.init_inference(model, mp_size=mp_size, dtype=torch.float16,
replace_method='auto', replace_with_kernel_inject=True)
return pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)
def handle(inputs: Input):
global predictor
if not predictor:
predictor = get_model(inputs.get_properties())
if inputs.is_empty():
return None
data = inputs.get_as_string()
output = Output()
output.add_property("content-type", "application/json")
result = predictor(data, do_sample=True, max_new_tokens=50)
return output.add(result)Upload these files (and optionally the model weights) to an OSS bucket, then create a PersistentVolume (PV) and PersistentVolumeClaim (PVC) that mount the OSS path into the container.
3. Launch the service
arena serve custom \
--name=bloom7b1-deepspeed \
--gpus=2 \
--version=alpha \
--replicas=1 \
--restful-port=8080 \
--data=bloom7b1-pvc:/model \
--image=ai-studio-registry-vpc.cn-beijing.cr.aliyuncs.com/kube-ai/djl-serving:2023-05-19 \
"djl-serving -m"Check pod status with kubectl get pod and view logs to confirm that two processes (rank 0 and 1) are started, each loading the model and converting it to a DeepSpeed kernel.
4. Service verification
kubectl -n default-group port-forward svc/bloom7b1-deepspeed-alpha 9090:8080
curl -X POST http://127.0.0.1:9090/predictions/deepspeed \
-H "Content-Type: text/plain" -d "I'm very thirsty, I need"The response contains generated text from Bloom‑7B1, confirming successful inference.
5. Expose via Ingress
Create an Ingress resource (via console or YAML) that maps a domain name to the service, enabling external access and load‑balancing.
curl -X POST https://deepspeed-bloom7b1.example.com/predictions/deepspeed \
-H "Content-Type: text/plain" -d "I'm very thirsty, I need"Conclusion
The example demonstrates that with Arena, DeepSpeed Inference, and ACK, a Bloom‑7B1 model can be served on multiple GPUs with low latency, high throughput, and elastic scaling. Alternative distributed inference solutions such as FastTransformer + Triton are also viable for future cost‑effective, high‑performance deployments.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
