Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

This guide walks through setting up a Kunlun P800 XPU host, preparing Docker containers, deploying Qwen3‑8B/‑32B/‑VL models with vLLM‑Kunlun, benchmarking performance, and running full‑parameter DPO training using LLaMA‑Factory, providing scripts, configuration files, and troubleshooting tips for AI engineers.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Deploying Qwen3 on Kunlun P800: Full‑Parameter DPO Training and Inference Guide

Overview

The article presents a step‑by‑step practical guide for deploying Qwen3‑8B, Qwen3‑32B and Qwen3‑VL‑8B‑Instruct on a physical Kunlun P800 XPU server, running inference with the vLLM‑Kunlun plugin and performing full‑parameter Direct Preference Optimization (DPO) training via LLaMA‑Factory.

Test Environment

Hardware : 8 × Kunlun P800 cards (driver 5.0.21.21), Ubuntu 22.04, Docker image xpu_dev_20251202_172933-with-lf Performance : Qwen3‑32B (TP=8) 1184 tok/s, TTFT 1.8 s; Qwen3‑VL (TP=4) 1942 tok/s; Qwen3‑8B (TP=1) 1667 tok/s

Technical Stack : vLLM‑Kunlun for OpenAI‑compatible inference, LLaMA‑Factory for DPO training

Physical Machine Specifications

root@h3c:/mnt/nvme# lscpu | grep "Model name"
Model name:               Hygon_7470
root@h3c:/mnt/nvme# lscpu | grep Architecture
Architecture:             x86_64
root@h3c:/mnt/nvme# free -g
              total   used   free  shared  buff/cache  available
Mem:          1507     24    579      0        903        1475
Swap:            7      0      7
root@h3c:/mnt/nvme# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda    8:0    0 447.1G  0 disk
├─sda1 8:1    0   1G   0 part /boot/efi
├─sda2 8:2    0   2G   0 part /boot
└─sda3 8:3    0 444G   0 part
  └─ubuntu--vg-ubuntu--lv 253:1 0 444G 0 lvm /
nvme0n1 259:0 0 3.5T 0 disk
└─nvme0n1p1 259:2 0 3.5T 0 part /mnt/nvme
nvme1n1 259:1 0 3.5T 0 disk
└─nvme_vg-docker_lv 253:0 0 3.5T 0 lvm /mnt/nvmelvm

Docker Image Overview

Key images (size in GB):

h3c-repack-qwen3_32b-ubuntu22.04:20250725-v1.0   b12133610dc3   66.5GB
aiak-inference-llm:vllm_kunlun_20250917_214552   feb894951c53   105GB
aiak-inference-llm:xpu_dev_20251105_151131-vllm01011-awq   2b09210d3113   114GB
aiak-inference-llm:xpu_dev_20251202_172933   e98a5cf79b1b   132GB
aiak-inference-llm:xpu_dev_20251202_172933-with-lf   30d22a3c953e   135GB

The xpu_dev_20251202_172933-with-lf image includes Qwen3/Qwen3‑VL and pre‑installed LLaMA‑Factory and is the recommended one.

Container and Directory Preparation

/mnt/nvme/runtimes

→ mounted inside container at

/home/workspace
/workspace

inside container contains vllm, vllm‑kunlun, LLaMA‑Factory repositories.

Start Inference Container

#!/bin/bash

docker run -itd \
  --net=host \
  --pid=host \
  --cap-add=SYS_PTRACE --security-opt=seccomp=unconfined \
  --ulimit memlock=-1 --ulimit nofile=120000 --ulimit stack=67108864 \
  --shm-size=128G \
  --privileged \
  -v /usr/local/bin/:/usr/local/bin/ \
  --name=xpu_dev_20251202_172933-with-lf \
  -v /mnt/nvme/runtimes:/home/workspace \
  -w /workspace \
  -v /lib/x86_64-linux-gnu/libxpunvidia-ml.so.1:/lib/x86_64-linux-gnu/libxpunvidia-ml.so.1 \
  aiak-inference-llm:xpu_dev_20251202_172933-with-lf bash

docker exec -it xpu_dev_20251202_172933-with-lf /bin/bash
Recommendation: Use the xpu_dev_20251202_172933-with-lf image for a quick start.

Qwen3‑8B Inference Script

unset XPU_DUMMY_EVENT
export XPU_VISIBLE_DEVICES=0
export XPU_USE_MOE_SORTED_THRES=1
export XFT_USE_FAST_SWIGLU=1
export XMLIR_CUDNN_ENABLED=1
export XPU_USE_DEFAULT_CTX=1
export XMLIR_FORCE_USE_XPU_GRAPH=1
export XPU_USE_FAST_SWIGLU=1
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
export USE_ORI_ROPE=1

VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 8999 \
  --model /home/workspace/Qwen3-8B/ \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --max-model-len 32768 \
  --tensor-parallel-size 1 \
  --dtype float16 \
  --max_num_seqs 128 \
  --max_num_batched_tokens 32768 \
  --block-size 128 \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --distributed-executor-backend mp \
  --compilation-config '{"splitting_ops": ["vllm.unified_attention","vllm.unified_attention_with_output","vllm.unified_attention_with_output_kunlun","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"]}'
XPU_VISIBLE_DEVICES=0

: single‑card inference. --tensor-parallel-size 1: disables tensor parallelism. --gpu-memory-utilization 0.95: uses ~95 % of VRAM to avoid OOM.

Qwen3‑32B Inference Script (TP=8)

unset XPU_DUMMY_EVENT
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export XPU_USE_MOE_SORTED_THRES=1
export XFT_USE_FAST_SWIGLU=1
export XMLIR_CUDNN_ENABLED=1
export XPU_USE_DEFAULT_CTX=1
export XMLIR_FORCE_USE_XPU_GRAPH=1
export XPU_USE_FAST_SWIGLU=1
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
export USE_ORI_ROPE=1

VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 8999 \
  --model /home/workspace/Qwen3-32B/ \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --max-model-len 32768 \
  --tensor-parallel-size 8 \
  --dtype float16 \
  --max_num_seqs 128 \
  --max_num_batched_tokens 32768 \
  --block-size 128 \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --distributed-executor-backend mp \
  --compilation-config '{"splitting_ops": ["vllm.unified_attention","vllm.unified_attention_with_output","vllm.unified_attention_with_output_kunlun","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"]}'

Qwen3‑VL‑8B‑Instruct Inference Script (TP=4)

unset XPU_DUMMY_EVENT
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export XFT_USE_FAST_SWIGLU=1
export XPU_USE_FAST_SWIGLU=1
export XMLIR_CUDNN_ENABLED=1
export USE_FAST_BF16_FC=true
export XPU_USE_DEFAULT_CTX=1
export XMLIR_FORCE_USE_XPU_GRAPH=1
export XPU_USE_MOE_SORTED_THRES=128
export VLLM_HOST_IP=$(hostname -i)
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false
export USE_ORI_ROPE=1
export VLLM_USE_V1=1

python -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 8999 \
  --model /home/workspace/Qwen3-VL-8B-Instruct/ \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --max-model-len 32768 \
  --tensor-parallel-size 4 \
  --dtype float16 \
  --max_num_seqs 128 \
  --max_num_batched_tokens 32768 \
  --block-size 128 \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --distributed-executor-backend mp \
  --compilation-config '{"splitting_ops": ["vllm.unified_attention","vllm.unified_attention_with_output","vllm.unified_attention_with_output_kunlun","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"]}'
Tip: Do not modify --compilation-config or XPU environment variables unless you understand Kunlun graph compilation.

Performance Benchmark (vLLM Bench)

Benchmarks were run with vllm bench serve on the same container.

Qwen3‑32B (TP=2)

vllm bench serve --host 0.0.0.0 --port 8999 --backend vllm \
  --model /home/workspace/Qwen3-32B/ \
  --dataset-name random --num-prompts 128 \
  --random-input-len 1024 --random-output-len 1024 \
  --max-concurrency 48
============ Serving Benchmark Result ============
Successful requests:                 128
Maximum request concurrency:         48
Benchmark duration (s):              143.52
Total input tokens:                  130513
Total generated tokens:              118159
Request throughput (req/s):          0.89
Output token throughput (tok/s):     823.29
Peak output token throughput (tok/s):1152.00
Total Token throughput (tok/s):       1732.65
Mean TTFT (ms):                       4233.80
Median TTFT (ms):                     4543.36
P99 TTFT (ms):                       7821.77
Mean TPOT (ms):                       70.18
Median TPOT (ms):                     47.09
P99 TPOT (ms):                        923.11
Mean ITL (ms):                        45.04
Median ITL (ms):                      42.86
P99 ITL (ms):                         60.17
==================================================

Qwen3‑32B (TP=8)

vllm bench serve --host 0.0.0.0 --port 8999 --backend vllm \
  --model /home/workspace/Qwen3-32B/ \
  --dataset-name random --num-prompts 128 \
  --random-input-len 1024 --random-output-len 1024 \
  --max-concurrency 48
============ Serving Benchmark Result ============
Successful requests:                 128
Maximum request concurrency:         48
Benchmark duration (s):              98.70
Total input tokens:                  130513
Total generated tokens:              116917
Request throughput (req/s):          1.30
Output token throughput (tok/s):     1184.52
Peak output token throughput (tok/s):1642.00
Total Token throughput (tok/s):       2506.79
Mean TTFT (ms):                       1801.71
Median TTFT (ms):                     1865.50
P99 TTFT (ms):                         3376.12
Mean TPOT (ms):                       42.35
Median TPOT (ms):                     34.03
P99 TPOT (ms):                        421.33
Mean ITL (ms):                        32.48
Median ITL (ms):                      29.55
P99 ITL (ms):                         220.38
==================================================
Increasing tensor parallelism from 2 to 8 improves both throughput and TTFT, demonstrating the scaling capability of P800.

Qwen3‑VL‑8B‑Instruct (TP=4)

============ Serving Benchmark Result ============
Successful requests:                 128
Maximum request concurrency:         48
Benchmark duration (s):              55.70
Total input tokens:                  130513
Total generated tokens:              108218
Request throughput (req/s):          2.30
Output token throughput (tok/s):     1942.78
Peak output token throughput (tok/s):2736.00
Total Token throughput (tok/s):       4285.82
Mean TTFT (ms):                       877.81
Median TTFT (ms):                     906.67
P99 TTFT (ms):                       1645.36
Mean TPOT (ms):                       40.42
Median TPOT (ms):                     19.96
P99 TPOT (ms):                        464.13
Mean ITL (ms):                        19.50
Median ITL (ms):                      18.26
P99 ITL (ms):                         123.94
==================================================

Qwen3‑8B (TP=1)

============ Serving Benchmark Result ============
Successful requests:                 128
Maximum request concurrency:         48
Benchmark duration (s):              74.09
Total input tokens:                  130513
Total generated tokens:              123513
Request throughput (req/s):          1.73
Output token throughput (tok/s):     1667.00
Peak output token throughput (tok/s):2208.00
Total Token throughput (tok/s):       3428.48
Mean TTFT (ms):                       1558.48
Median TTFT (ms):                     1389.65
P99 TTFT (ms):                       2619.69
Mean TPOT (ms):                       35.93
Median TPOT (ms):                     24.24
P99 TPOT (ms):                        39.07
Mean ITL (ms):                        23.48
Median ITL (ms):                      23.26
P99 ITL (ms):                         26.09
==================================================
These results provide reliable references for capacity planning and further parameter tuning.

Full‑Parameter DPO Training with LLaMA‑Factory

The training uses the same P800 cluster and the xpu_dev_20251202_172933-with-lf image.

Training Configuration (qwen3_full_dpo.yaml)

### model
model_name_or_path: /home/workspace/Qwen3-8B/
trust_remote_code: true

### method
stage: dpo
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: dpo_en_demo
template: qwen3_nothink
cutoff_len: 2048
max_samples: 1000
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: /home/workspace/saves/qwen3-8b/full/dpo
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-6
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

Key points:

stage: dpo – selects Direct Preference Optimization.

finetuning_type: full – full‑parameter fine‑tuning (≈81.9 B trainable parameters).

bf16: true – uses BF16 for stability and performance.

Batch size is derived from

per_device_train_batch_size × gradient_accumulation_steps × number of cards

.

Start Training

(python310_torch25_cuda) root@h3c:/workspace/LLaMA-Factory# llamafactory-cli train /home/workspace/qwen3_full_dpo.yaml

Sample log excerpt:

[INFO|trainer.py:2519] 2026-01-06 17:57:54,377 >> ***** Running training *****
[INFO|trainer.py:2520] 2026-01-06 17:57:54,377 >>   Num examples = 300
[INFO|trainer.py:2521] 2026-01-06 17:57:54,377 >>   Num Epochs = 3
[INFO|trainer.py:2522] 2026-01-06 17:57:54,377 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2525] 2026-01-06 17:57:54,377 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:2526] 2026-01-06 17:57:54,377 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2527] 2026-01-06 17:57:54,377 >>   Total optimization steps = 15
[INFO|trainer.py:2528] 2026-01-06 17:57:54,379 >>   Number of trainable parameters = 8,190,735,360
{ 'loss': 0.3777, 'grad_norm': 5.1667, 'learning_rate': 2.20e-06, 'rewards/chosen': 0.0152, 'rewards/rejected': -2.6984, 'rewards/accuracies': 0.6743, 'rewards/margins': 2.7136, 'logps/chosen': -442.07, 'logps/rejected': -538.82, 'logits/chosen': -0.4599, 'logits/rejected': -0.4149, 'epoch': 2.0 }
...
{ 'train_runtime': 492.6162, 'train_samples_per_second': 1.827, 'train_steps_per_second': 0.03, 'train_loss': 0.2571, 'epoch': 3.0 }

Observations:

Total train batch size = 64 (effective global batch).

≈81.9 B trainable parameters correspond to Qwen3‑8B full‑parameter model.

Training loss steadily decreases, indicating stable convergence.

Checkpoint and Export

Checkpoints are saved every save_steps (500) and a final checkpoint is written at the end.

[INFO|trainer.py:4309] 2026-01-06 18:04:45,789 >> Saving model checkpoint to /home/workspace/saves/qwen3-8b/full/dpo/checkpoint-15
...
[2026-01-06 18:06:05,793] [INFO] [engine.py:3478:_save_zero_checkpoint] zero checkpoint saved /home/workspace/saves/qwen3-8b/full/dpo/checkpoint-15/global_step15/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
...
[INFO|trainer.py:4309] 2026-01-06 18:06:19,884 >> Saving model checkpoint to /home/workspace/saves/qwen3-8b/full/dpo

Resulting directory layout (selected files):

added_tokens.json
config.json
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model.safetensors.index.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
training_loss.png
training_rewards_accuracies.png
trainer_state.json
trainer_log.jsonl
Next step: Mount this directory into a new vLLM container and point --model to it for inference with the DPO‑fine‑tuned weights.

Common Issues and Troubleshooting

vLLM Service Starts Slowly

Symptom : First launch takes several minutes, logs show “Graph capturing / warmup”.

Cause : vLLM‑Kunlun compiles the computation graph on first run.

Solution : Wait 5–15 minutes for the initial start; subsequent restarts are much faster.

Out‑of‑Memory Errors

Reduce --max-model-len, --max_num_seqs or --max_num_batched_tokens.

Lower --gpu-memory-utilization (e.g., 0.9).

Ensure no other processes occupy the same P800 memory.

Performance Below Expectations

Verify that XPU_VISIBLE_DEVICES and --tensor-parallel-size match the intended card count.

Use xpu-smi to monitor utilization and temperature for possible throttling.

Compare TP=1 vs TP>1 benchmark results to confirm multi‑card scaling.

Conclusion and Future Directions

The guide demonstrates a complete workflow for running Qwen3 series models on Kunlun P800 XPU hardware, from environment validation and container setup to inference deployment, performance benchmarking, and full‑parameter DPO training. Engineers can now extend the workflow to multi‑node clusters, replace the demo DPO dataset with domain‑specific preference data, and integrate the vLLM OpenAI endpoint into internal model‑service gateways.

vLLMLLaMA‑FactoryInferenceDPOtrainingQwen3Kunlun P800
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.