Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide
This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.
Overview
The article presents a step‑by‑step practical guide for deploying Qwen3‑8B, Qwen3‑32B and Qwen3‑VL‑8B‑Instruct on a physical Kunlun P800 XPU server, running inference with the vLLM‑Kunlun plugin and performing full‑parameter Direct Preference Optimization (DPO) training via LLaMA‑Factory.
Test Environment
Hardware : 8 × Kunlun P800 cards (driver 5.0.21.21), Ubuntu 22.04, Docker image xpu_dev_20251202_172933-with-lf Performance : Qwen3‑32B (TP=8) 1184 tok/s, TTFT 1.8 s; Qwen3‑VL (TP=4) 1942 tok/s; Qwen3‑8B (TP=1) 1667 tok/s
Technical Stack : vLLM‑Kunlun for OpenAI‑compatible inference, LLaMA‑Factory for DPO training
Physical Machine Specifications
root@h3c:/mnt/nvme# lscpu | grep "Model name"
Model name: Hygon_7470
root@h3c:/mnt/nvme# lscpu | grep Architecture
Architecture: x86_64
root@h3c:/mnt/nvme# free -g
total used free shared buff/cache available
Mem: 1507 24 579 0 903 1475
Swap: 7 0 7
root@h3c:/mnt/nvme# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 447.1G 0 disk
├─sda1 8:1 0 1G 0 part /boot/efi
├─sda2 8:2 0 2G 0 part /boot
└─sda3 8:3 0 444G 0 part
└─ubuntu--vg-ubuntu--lv 253:1 0 444G 0 lvm /
nvme0n1 259:0 0 3.5T 0 disk
└─nvme0n1p1 259:2 0 3.5T 0 part /mnt/nvme
nvme1n1 259:1 0 3.5T 0 disk
└─nvme_vg-docker_lv 253:0 0 3.5T 0 lvm /mnt/nvmelvmDocker Image Overview
Key images (size in GB):
h3c-repack-qwen3_32b-ubuntu22.04:20250725-v1.0 b12133610dc3 66.5GB
aiak-inference-llm:vllm_kunlun_20250917_214552 feb894951c53 105GB
aiak-inference-llm:xpu_dev_20251105_151131-vllm01011-awq 2b09210d3113 114GB
aiak-inference-llm:xpu_dev_20251202_172933 e98a5cf79b1b 132GB
aiak-inference-llm:xpu_dev_20251202_172933-with-lf 30d22a3c953e 135GBThe xpu_dev_20251202_172933-with-lf image includes Qwen3/Qwen3‑VL and pre‑installed LLaMA‑Factory and is the recommended one.
Container and Directory Preparation
/mnt/nvme/runtimes→ mounted inside container at
/home/workspace /workspaceinside container contains vllm, vllm‑kunlun, LLaMA‑Factory repositories.
Start Inference Container
#!/bin/bash
docker run -itd \
--net=host \
--pid=host \
--cap-add=SYS_PTRACE --security-opt=seccomp=unconfined \
--ulimit memlock=-1 --ulimit nofile=120000 --ulimit stack=67108864 \
--shm-size=128G \
--privileged \
-v /usr/local/bin/:/usr/local/bin/ \
--name=xpu_dev_20251202_172933-with-lf \
-v /mnt/nvme/runtimes:/home/workspace \
-w /workspace \
-v /lib/x86_64-linux-gnu/libxpunvidia-ml.so.1:/lib/x86_64-linux-gnu/libxpunvidia-ml.so.1 \
aiak-inference-llm:xpu_dev_20251202_172933-with-lf bash
docker exec -it xpu_dev_20251202_172933-with-lf /bin/bashRecommendation: Use the xpu_dev_20251202_172933-with-lf image for a quick start.
Qwen3‑8B Inference Script
unset XPU_DUMMY_EVENT
export XPU_VISIBLE_DEVICES=0
export XPU_USE_MOE_SORTED_THRES=1
export XFT_USE_FAST_SWIGLU=1
export XMLIR_CUDNN_ENABLED=1
export XPU_USE_DEFAULT_CTX=1
export XMLIR_FORCE_USE_XPU_GRAPH=1
export XPU_USE_FAST_SWIGLU=1
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
export USE_ORI_ROPE=1
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8999 \
--model /home/workspace/Qwen3-8B/ \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--max_num_seqs 128 \
--max_num_batched_tokens 32768 \
--block-size 128 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--compilation-config '{"splitting_ops": ["vllm.unified_attention","vllm.unified_attention_with_output","vllm.unified_attention_with_output_kunlun","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"]}' XPU_VISIBLE_DEVICES=0: single‑card inference. --tensor-parallel-size 1: disables tensor parallelism. --gpu-memory-utilization 0.95: uses ~95 % of VRAM to avoid OOM.
Qwen3‑32B Inference Script (TP=8)
unset XPU_DUMMY_EVENT
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export XPU_USE_MOE_SORTED_THRES=1
export XFT_USE_FAST_SWIGLU=1
export XMLIR_CUDNN_ENABLED=1
export XPU_USE_DEFAULT_CTX=1
export XMLIR_FORCE_USE_XPU_GRAPH=1
export XPU_USE_FAST_SWIGLU=1
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
export USE_ORI_ROPE=1
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8999 \
--model /home/workspace/Qwen3-32B/ \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 8 \
--dtype float16 \
--max_num_seqs 128 \
--max_num_batched_tokens 32768 \
--block-size 128 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--compilation-config '{"splitting_ops": ["vllm.unified_attention","vllm.unified_attention_with_output","vllm.unified_attention_with_output_kunlun","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"]}'Qwen3‑VL‑8B‑Instruct Inference Script (TP=4)
unset XPU_DUMMY_EVENT
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export XFT_USE_FAST_SWIGLU=1
export XPU_USE_FAST_SWIGLU=1
export XMLIR_CUDNN_ENABLED=1
export USE_FAST_BF16_FC=true
export XPU_USE_DEFAULT_CTX=1
export XMLIR_FORCE_USE_XPU_GRAPH=1
export XPU_USE_MOE_SORTED_THRES=128
export VLLM_HOST_IP=$(hostname -i)
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
export USE_ORI_ROPE=1
export VLLM_USE_V1=1
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8999 \
--model /home/workspace/Qwen3-VL-8B-Instruct/ \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 4 \
--dtype float16 \
--max_num_seqs 128 \
--max_num_batched_tokens 32768 \
--block-size 128 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--compilation-config '{"splitting_ops": ["vllm.unified_attention","vllm.unified_attention_with_output","vllm.unified_attention_with_output_kunlun","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"]}'Tip: Do not modify --compilation-config or XPU environment variables unless you understand Kunlun graph compilation.
Performance Benchmark (vLLM Bench)
Benchmarks were run with vllm bench serve on the same container.
Qwen3‑32B (TP=2)
vllm bench serve --host 0.0.0.0 --port 8999 --backend vllm \
--model /home/workspace/Qwen3-32B/ \
--dataset-name random --num-prompts 128 \
--random-input-len 1024 --random-output-len 1024 \
--max-concurrency 48 ============ Serving Benchmark Result ============
Successful requests: 128
Maximum request concurrency: 48
Benchmark duration (s): 143.52
Total input tokens: 130513
Total generated tokens: 118159
Request throughput (req/s): 0.89
Output token throughput (tok/s): 823.29
Peak output token throughput (tok/s):1152.00
Total Token throughput (tok/s): 1732.65
Mean TTFT (ms): 4233.80
Median TTFT (ms): 4543.36
P99 TTFT (ms): 7821.77
Mean TPOT (ms): 70.18
Median TPOT (ms): 47.09
P99 TPOT (ms): 923.11
Mean ITL (ms): 45.04
Median ITL (ms): 42.86
P99 ITL (ms): 60.17
==================================================Qwen3‑32B (TP=8)
vllm bench serve --host 0.0.0.0 --port 8999 --backend vllm \
--model /home/workspace/Qwen3-32B/ \
--dataset-name random --num-prompts 128 \
--random-input-len 1024 --random-output-len 1024 \
--max-concurrency 48 ============ Serving Benchmark Result ============
Successful requests: 128
Maximum request concurrency: 48
Benchmark duration (s): 98.70
Total input tokens: 130513
Total generated tokens: 116917
Request throughput (req/s): 1.30
Output token throughput (tok/s): 1184.52
Peak output token throughput (tok/s):1642.00
Total Token throughput (tok/s): 2506.79
Mean TTFT (ms): 1801.71
Median TTFT (ms): 1865.50
P99 TTFT (ms): 3376.12
Mean TPOT (ms): 42.35
Median TPOT (ms): 34.03
P99 TPOT (ms): 421.33
Mean ITL (ms): 32.48
Median ITL (ms): 29.55
P99 ITL (ms): 220.38
==================================================Increasing tensor parallelism from 2 to 8 improves both throughput and TTFT, demonstrating the scaling capability of P800.
Qwen3‑VL‑8B‑Instruct (TP=4)
============ Serving Benchmark Result ============
Successful requests: 128
Maximum request concurrency: 48
Benchmark duration (s): 55.70
Total input tokens: 130513
Total generated tokens: 108218
Request throughput (req/s): 2.30
Output token throughput (tok/s): 1942.78
Peak output token throughput (tok/s):2736.00
Total Token throughput (tok/s): 4285.82
Mean TTFT (ms): 877.81
Median TTFT (ms): 906.67
P99 TTFT (ms): 1645.36
Mean TPOT (ms): 40.42
Median TPOT (ms): 19.96
P99 TPOT (ms): 464.13
Mean ITL (ms): 19.50
Median ITL (ms): 18.26
P99 ITL (ms): 123.94
==================================================Qwen3‑8B (TP=1)
============ Serving Benchmark Result ============
Successful requests: 128
Maximum request concurrency: 48
Benchmark duration (s): 74.09
Total input tokens: 130513
Total generated tokens: 123513
Request throughput (req/s): 1.73
Output token throughput (tok/s): 1667.00
Peak output token throughput (tok/s):2208.00
Total Token throughput (tok/s): 3428.48
Mean TTFT (ms): 1558.48
Median TTFT (ms): 1389.65
P99 TTFT (ms): 2619.69
Mean TPOT (ms): 35.93
Median TPOT (ms): 24.24
P99 TPOT (ms): 39.07
Mean ITL (ms): 23.48
Median ITL (ms): 23.26
P99 ITL (ms): 26.09
==================================================These results provide reliable references for capacity planning and further parameter tuning.
Full‑Parameter DPO Training with LLaMA‑Factory
The training uses the same P800 cluster and the xpu_dev_20251202_172933-with-lf image.
Training Configuration (qwen3_full_dpo.yaml)
### model
model_name_or_path: /home/workspace/Qwen3-8B/
trust_remote_code: true
### method
stage: dpo
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
### dataset
dataset: dpo_en_demo
template: qwen3_nothink
cutoff_len: 2048
max_samples: 1000
preprocessing_num_workers: 16
dataloader_num_workers: 4
### output
output_dir: /home/workspace/saves/qwen3-8b/full/dpo
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: nullKey points:
stage: dpo – selects Direct Preference Optimization.
finetuning_type: full – full‑parameter fine‑tuning (≈81.9 B trainable parameters).
bf16: true – uses BF16 for stability and performance.
Batch size is derived from
per_device_train_batch_size × gradient_accumulation_steps × number of cards.
Start Training
(python310_torch25_cuda) root@h3c:/workspace/LLaMA-Factory# llamafactory-cli train /home/workspace/qwen3_full_dpo.yamlSample log excerpt:
[INFO|trainer.py:2519] 2026-01-06 17:57:54,377 >> ***** Running training *****
[INFO|trainer.py:2520] 2026-01-06 17:57:54,377 >> Num examples = 300
[INFO|trainer.py:2521] 2026-01-06 17:57:54,377 >> Num Epochs = 3
[INFO|trainer.py:2522] 2026-01-06 17:57:54,377 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2525] 2026-01-06 17:57:54,377 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:2526] 2026-01-06 17:57:54,377 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2527] 2026-01-06 17:57:54,377 >> Total optimization steps = 15
[INFO|trainer.py:2528] 2026-01-06 17:57:54,379 >> Number of trainable parameters = 8,190,735,360
{ 'loss': 0.3777, 'grad_norm': 5.1667, 'learning_rate': 2.20e-06, 'rewards/chosen': 0.0152, 'rewards/rejected': -2.6984, 'rewards/accuracies': 0.6743, 'rewards/margins': 2.7136, 'logps/chosen': -442.07, 'logps/rejected': -538.82, 'logits/chosen': -0.4599, 'logits/rejected': -0.4149, 'epoch': 2.0 }
...
{ 'train_runtime': 492.6162, 'train_samples_per_second': 1.827, 'train_steps_per_second': 0.03, 'train_loss': 0.2571, 'epoch': 3.0 }Observations:
Total train batch size = 64 (effective global batch).
≈81.9 B trainable parameters correspond to Qwen3‑8B full‑parameter model.
Training loss steadily decreases, indicating stable convergence.
Checkpoint and Export
Checkpoints are saved every save_steps (500) and a final checkpoint is written at the end.
[INFO|trainer.py:4309] 2026-01-06 18:04:45,789 >> Saving model checkpoint to /home/workspace/saves/qwen3-8b/full/dpo/checkpoint-15
...
[2026-01-06 18:06:05,793] [INFO] [engine.py:3478:_save_zero_checkpoint] zero checkpoint saved /home/workspace/saves/qwen3-8b/full/dpo/checkpoint-15/global_step15/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
...
[INFO|trainer.py:4309] 2026-01-06 18:06:19,884 >> Saving model checkpoint to /home/workspace/saves/qwen3-8b/full/dpoResulting directory layout (selected files):
added_tokens.json
config.json
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model.safetensors.index.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
training_loss.png
training_rewards_accuracies.png
trainer_state.json
trainer_log.jsonlNext step: Mount this directory into a new vLLM container and point --model to it for inference with the DPO‑fine‑tuned weights.
Common Issues and Troubleshooting
vLLM Service Starts Slowly
Symptom : First launch takes several minutes, logs show “Graph capturing / warmup”.
Cause : vLLM‑Kunlun compiles the computation graph on first run.
Solution : Wait 5–15 minutes for the initial start; subsequent restarts are much faster.
Out‑of‑Memory Errors
Reduce --max-model-len, --max_num_seqs or --max_num_batched_tokens.
Lower --gpu-memory-utilization (e.g., 0.9).
Ensure no other processes occupy the same P800 memory.
Performance Below Expectations
Verify that XPU_VISIBLE_DEVICES and --tensor-parallel-size match the intended card count.
Use xpu-smi to monitor utilization and temperature for possible throttling.
Compare TP=1 vs TP>1 benchmark results to confirm multi‑card scaling.
Conclusion and Future Directions
The guide demonstrates a complete workflow for running Qwen3 series models on Kunlun P800 XPU hardware, from environment validation and container setup to inference deployment, performance benchmarking, and full‑parameter DPO training. Engineers can now extend the workflow to multi‑node clusters, replace the demo DPO dataset with domain‑specific preference data, and integrate the vLLM OpenAI endpoint into internal model‑service gateways.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
