How to Fine‑Tune and Deploy Qwen‑72B‑Chat on Alibaba Cloud PAI
This guide walks you through preparing the environment, downloading Qwen‑72B‑Chat, performing LoRA fine‑tuning on PAI‑DSW, merging weights, and deploying the model for offline inference, web UI, API, and PAI SDK services on Alibaba Cloud.
Introduction
Qwen‑72B (Qwen‑72B) is a 72‑billion‑parameter large language model developed by Alibaba Cloud. Qwen‑72B‑Chat is an AI assistant built on this model using alignment techniques. Alibaba Cloud AI platform PAI provides end‑to‑end services for data labeling, model building, training, deployment, and inference.
Environment Requirements
Recommended GPU: GU108 (80 GB). Inference needs at least 2 GPUs; LoRA fine‑tuning needs at least 4 GPUs. Region: Wulanchabu. Cluster: Lingjun. Docker image:
pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:vllm-0.2.1-v6Preparation
Download the model files either via script or from ModelScope and extract them.
def aria2(url, filename, d):
!aria2c --console-log-level=error -c -x 16 -s 16 {url} -o {filename} -d {d}
qwen72b_url = f"http://pai-vision-data-inner-wulanchabu.oss-cn-wulanchabu-internal.aliyuncs.com/qwen72b/Qwen-72B-Chat-sharded.tar"
aria2(qwen72b_url, qwen72b_url.split("/")[-1], "/root/")
! cd /root && tar -xvf Qwen-72B-Chat-sharded.tarLoRA Fine‑Tuning
Download a sample dataset and run the fine‑tuning script with LoRA enabled.
! wget -c http://pai-vision-data-inner-wulanchabu.oss-cn-wulanchabu.aliyuncs.com/qwen72b/sharegpt_zh_1K.json -P /workspace/Qwen
! cd /workspace/Qwen && CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 8 \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6001 \
finetune.py \
--model_name_or_path /root/Qwen-72B-Chat-sharded \
--data_path sharegpt_zh_1K.json \
--bf16 True \
--output_dir /root/output_qwen \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 3e-4 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--lazy_preprocess True \
--use_lora \
--gradient_checkpointing \
--deepspeed finetune/ds_config_zero3.jsonMerge LoRA Weights
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
'/root/output_qwen',
device_map="auto",
trust_remote_code=True
).eval()
merged_model = model.merge_and_unload()
merged_model.save_pretrained('/root/qwen72b_sft', max_shard_size="2048MB", safe_serialization=True)Copy token files:
! cp /root/Qwen-72B-Chat-sharded/qwen.tiktoken /root/qwen72b_sft/
! cp /root/Qwen-72B-Chat-sharded/tokenization_qwen.py /root/qwen72b_sft/
! cp /root/Qwen-72B-Chat-sharded/tokenizer_config.json /root/qwen72b_sft/Offline Inference
from vllm import LLM
from vllm.sampling_params import SamplingParams
qwen72b = LLM("/root/qwen72b_sft/", tensor_parallel_size=2, trust_remote_code=True, gpu_memory_utilization=0.99)
samplingparams = SamplingParams(temperature=0.0, max_tokens=512, stop=["<|im_end|>"])
prompt = """<|im_start|>system
<|im_end|>
<|im_start|>user
<|im_end|>
Hello! What is your name?<|im_end|>
<|im_start|>assistant
"""
output = qwen72b.generate(prompt, samplingparams)
print(output)
# Release model
del qwen72bWeb UI Deployment
Start controller: python -m fastchat.serve.controller Start VLLM worker:
python -m fastchat.serve.vllm_worker --model-path /root/qwen72b_sft --tensor-parallel-size 2 --trust-remote-code --gpu-memory-utilization 0.98Launch Gradio web server:
python -m fastchat.serve.gradio_web_server_pai --model-list-mode reloadKill all fastchat services when done:
kill -s 9 `ps -aux | grep fastchat | awk '{print $2}'`API Deployment
Start controller (same as above).
Start VLLM worker (same as above).
Start OpenAI‑compatible API server:
python -m fastchat.serve.openai_api_server --host localhost --port 8000Example Python client:
import openai
openai.api_key = "EMPTY"
openai.api_base = "http://0.0.0.0:8000/v1"
model = "qwen72b_sft"
completion = openai.ChatCompletion.create(
model=model,
temperature=0.0,
top_p=0.8,
frequency_penalty=0.0,
messages=[{"role": "user", "content": "你好"}]
)
print(completion.choices[0].message.content)Kill services with the same kill command.
PAI SDK Deployment (EAS Service)
Install SDK:
! python -m pip install alipai==0.4.4.post0 -i https://pypi.org/simpleConfigure access key, workspace, and OSS bucket via python -m pai.toolkit.config. Upload model to OSS:
import pai
from pai.session import get_default_session
sess = get_default_session()
from pai.common.oss_utils import upload
model_uri = upload(source_path='/root/qwen72b_sft', oss_path='qwen72b_sft', bucket=sess.oss_bucket)
print(model_uri)Deploy with a JSON config (example shown) using
Model().deploy(service_name='qwen72b_sdk_blade_example', options=config). Call the service via WebSocket with appropriate headers and prompt.
import json, time
from websockets.sync.client import connect
headers = {"Authorization": "*******"}
url = 'ws://.../api/predict/qwen72b_sdk_blade_example/generate_stream'
prompt = """<|im_start|>system
<|im_end|>
<|im_start|>user
<|im_end|>
Hello! What is your name?<|im_end|>
<|im_start|>assistant
"""
with connect(url, additional_headers=headers) as websocket:
websocket.send(json.dumps({
"prompt": prompt,
"sampling_params": {"temperature": 0.0, "top_p": 0.9, "top_k": 50},
"stopping_criterial": {"max_new_tokens": 512, "stop_tokens": [151645,151644,151643]}
}))
tic = time.time()
while True:
msg = json.loads(websocket.recv())
if msg['is_ok']:
print(msg['tokens'][0]['text'], end="", flush=True)
if msg['is_finished']:
break
print(time.time()-tic)
print("-"*40)Delete the service after testing with m.delete_service().
FastChat Deployment (Web UI)
Similar to the earlier Web UI steps, but using a different container image and mounting the model from OSS. Deploy with a JSON config and delete with m.delete_service() when finished.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
