Deploy Qwen3 8B Model with vLLM: Step‑by‑Step Guide for Remote Inference

This guide walks you through deploying Alibaba’s open‑source Qwen‑3 8B model on the SumW platform using vLLM, covering environment activation, server launch with OpenAI‑compatible parameters, SSH tunneling for remote access, and Python client calls, while highlighting key configuration tips and common pitfalls.

SuanNi
SuanNi
SuanNi
Deploy Qwen3 8B Model with vLLM: Step‑by‑Step Guide for Remote Inference

Overview

Alibaba’s Qwen‑3 series is among the strongest open‑source large language models currently available. The 8B variant offers performance comparable to larger Qwen‑2.5 models, especially in STEM, coding, and reasoning tasks, and supports 119 languages, enhanced agent capabilities, and a 128K token context window.

Prerequisites

You need a SumW account, access to the GPU‑enabled compute market, and the vllm inference service installed on the remote node.

Step 1 – Activate the Python Environment

source /torch/venv3/pytorch_infer/bin/activate

Step 2 – Launch the vLLM Server

Run the following command to start an OpenAI‑compatible inference server. Adjust the memory‑related flags according to your hardware.

python3 -m vllm.entrypoints.openai.api_server \</code><code>    --model ./Qwen3-8B \</code><code>    --served-model-name Qwen3-8B \</code><code>    --device mlu \</code><code>    --dtype float16 \</code><code>    --host 0.0.0.0 \</code><code>    --port 6006 \</code><code>    --api-key hahahaha \</code><code>    --trust-remote-code \</code><code>    --max-model-len 10000 \</code><code>    --block-size 10000 \</code><code>    --max-seq-len-to-capture 10000 \</code><code>    --gpu-memory-utilization 0.95 \</code><code>    --disable-log-requests

Key parameters:

--port 6006 : Listening port for the service.

--api-key : Authentication token for client requests.

--gpu-memory-utilization 0.95 : Use up to 95% of GPU memory.

Step 3 – Set Up Remote Access (SSH Tunnel)

Because the model runs on a remote server, forward the service port to your local machine.

ssh -L 6006:127.0.0.1:6006 -o ProxyCommand="ssh -p [jump_port] [jump_user]@[jump_ip] -W %h:%p" [target_user]@[target_ip]

After entering the passwords for the jump host and the target host, you can reach the model at http://127.0.0.1:6006.

Step 4 – Call the Model from Local Python

Use the official OpenAI SDK to send requests to the remote service.

from openai import OpenAI

# Initialize client (port must match the server configuration)
client = OpenAI(
    base_url="http://127.0.0.1:6006/v1",
    api_key="hahahaha"
)

# Create a chat completion request
completion = client.chat.completions.create(
    model="Qwen3-8B",
    messages=[{"role": "user", "content": "你好"}]
)

print(completion.choices[0].message.content)

Important Tips

Ensure the port number in the server launch command ( --port 6006) matches the base_url in the Python client.

The model name in the client ( model="Qwen3-8B") must be identical to the --served-model-name used when starting the server.

If the server fails to start due to memory pressure, reduce --max-model-len or lower --gpu-memory-utilization.

Illustrative Screenshots

Below are key UI screenshots from the SumW platform that show how to select the Qwen‑3 image, confirm the rental, and access the JupyterLab console.

model deploymentvLLMPython SDKOpenAI APIQwen3SSH tunneling
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.