Artificial Intelligence 12 min read

How to Fine‑Tune and Deploy Qwen‑72B‑Chat on Alibaba Cloud PAI

This guide walks you through preparing the environment, downloading Qwen‑72B‑Chat, performing LoRA fine‑tuning on PAI‑DSW, merging weights, and deploying the model for offline inference, web UI, API, and PAI SDK services on Alibaba Cloud.

Alibaba Cloud Big Data AI Platform

Dec 15, 2023

How to Fine‑Tune and Deploy Qwen‑72B‑Chat on Alibaba Cloud PAI

Introduction

Qwen‑72B (Qwen‑72B) is a 72‑billion‑parameter large language model developed by Alibaba Cloud. Qwen‑72B‑Chat is an AI assistant built on this model using alignment techniques. Alibaba Cloud AI platform PAI provides end‑to‑end services for data labeling, model building, training, deployment, and inference.

Environment Requirements

Recommended GPU: GU108 (80 GB). Inference needs at least 2 GPUs; LoRA fine‑tuning needs at least 4 GPUs. Region: Wulanchabu. Cluster: Lingjun. Docker image:

pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/llm-inference:vllm-0.2.1-v6

Preparation

Download the model files either via script or from ModelScope and extract them.

def aria2(url, filename, d):
    !aria2c --console-log-level=error -c -x 16 -s 16 {url} -o {filename} -d {d}

qwen72b_url = f"http://pai-vision-data-inner-wulanchabu.oss-cn-wulanchabu-internal.aliyuncs.com/qwen72b/Qwen-72B-Chat-sharded.tar"
aria2(qwen72b_url, qwen72b_url.split("/")[-1], "/root/")
! cd /root && tar -xvf Qwen-72B-Chat-sharded.tar

LoRA Fine‑Tuning

Download a sample dataset and run the fine‑tuning script with LoRA enabled.

! wget -c http://pai-vision-data-inner-wulanchabu.oss-cn-wulanchabu.aliyuncs.com/qwen72b/sharegpt_zh_1K.json -P /workspace/Qwen

! cd /workspace/Qwen && CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 8 \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6001 \
finetune.py \
--model_name_or_path /root/Qwen-72B-Chat-sharded \
--data_path sharegpt_zh_1K.json \
--bf16 True \
--output_dir /root/output_qwen \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 3e-4 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--lazy_preprocess True \
--use_lora \
--gradient_checkpointing \
--deepspeed finetune/ds_config_zero3.json

Merge LoRA Weights

from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
    '/root/output_qwen',
    device_map="auto",
    trust_remote_code=True
).eval()
merged_model = model.merge_and_unload()
merged_model.save_pretrained('/root/qwen72b_sft', max_shard_size="2048MB", safe_serialization=True)

Copy token files:

! cp /root/Qwen-72B-Chat-sharded/qwen.tiktoken /root/qwen72b_sft/
! cp /root/Qwen-72B-Chat-sharded/tokenization_qwen.py /root/qwen72b_sft/
! cp /root/Qwen-72B-Chat-sharded/tokenizer_config.json /root/qwen72b_sft/

Offline Inference

from vllm import LLM
from vllm.sampling_params import SamplingParams
qwen72b = LLM("/root/qwen72b_sft/", tensor_parallel_size=2, trust_remote_code=True, gpu_memory_utilization=0.99)
samplingparams = SamplingParams(temperature=0.0, max_tokens=512, stop=["<|im_end|>"])
prompt = """<|im_start|>system
<|im_end|>
<|im_start|>user
<|im_end|>
Hello! What is your name?<|im_end|>
<|im_start|>assistant
"""
output = qwen72b.generate(prompt, samplingparams)
print(output)
# Release model
del qwen72b

Web UI Deployment

Start controller: python -m fastchat.serve.controller Start VLLM worker:

python -m fastchat.serve.vllm_worker --model-path /root/qwen72b_sft --tensor-parallel-size 2 --trust-remote-code --gpu-memory-utilization 0.98

Launch Gradio web server:

python -m fastchat.serve.gradio_web_server_pai --model-list-mode reload

Kill all fastchat services when done:

kill -s 9 `ps -aux | grep fastchat | awk '{print $2}'`

API Deployment

Start controller (same as above).

Start VLLM worker (same as above).

Start OpenAI‑compatible API server:

python -m fastchat.serve.openai_api_server --host localhost --port 8000

Example Python client:

import openai
openai.api_key = "EMPTY"
openai.api_base = "http://0.0.0.0:8000/v1"
model = "qwen72b_sft"
completion = openai.ChatCompletion.create(
    model=model,
    temperature=0.0,
    top_p=0.8,
    frequency_penalty=0.0,
    messages=[{"role": "user", "content": "你好"}]
)
print(completion.choices[0].message.content)

Kill services with the same kill command.

PAI SDK Deployment (EAS Service)

Install SDK:

! python -m pip install alipai==0.4.4.post0 -i https://pypi.org/simple

Configure access key, workspace, and OSS bucket via python -m pai.toolkit.config. Upload model to OSS:

import pai
from pai.session import get_default_session
sess = get_default_session()
from pai.common.oss_utils import upload
model_uri = upload(source_path='/root/qwen72b_sft', oss_path='qwen72b_sft', bucket=sess.oss_bucket)
print(model_uri)

Deploy with a JSON config (example shown) using

Model().deploy(service_name='qwen72b_sdk_blade_example', options=config)

. Call the service via WebSocket with appropriate headers and prompt.

import json, time
from websockets.sync.client import connect
headers = {"Authorization": "*******"}
url = 'ws://.../api/predict/qwen72b_sdk_blade_example/generate_stream'
prompt = """<|im_start|>system
<|im_end|>
<|im_start|>user
<|im_end|>
Hello! What is your name?<|im_end|>
<|im_start|>assistant
"""
with connect(url, additional_headers=headers) as websocket:
    websocket.send(json.dumps({
        "prompt": prompt,
        "sampling_params": {"temperature": 0.0, "top_p": 0.9, "top_k": 50},
        "stopping_criterial": {"max_new_tokens": 512, "stop_tokens": [151645,151644,151643]}
    }))
    tic = time.time()
    while True:
        msg = json.loads(websocket.recv())
        if msg['is_ok']:
            print(msg['tokens'][0]['text'], end="", flush=True)
            if msg['is_finished']:
                break
    print(time.time()-tic)
    print("-"*40)

Delete the service after testing with m.delete_service().

FastChat Deployment (Web UI)

Similar to the earlier Web UI steps, but using a different container image and mounting the model from OSS. Deploy with a JSON config and delete with m.delete_service() when finished.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

fine-tuning LoRA PAI Qwen-72B

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.