Artificial Intelligence 11 min read

How to Deploy and Run Large Language Models with Xinference: A Step‑by‑Step Guide

Xinference is a powerful distributed inference framework that enables quick deployment and efficient serving of open‑source large language models via Docker or source installation, offering Web UI, CLI, and API interfaces with detailed setup, model launching, and Chatbox integration instructions.

Architect's Alchemy Furnace

Mar 31, 2025

How to Deploy and Run Large Language Models with Xinference: A Step‑by‑Step Guide

Xinference is a fast‑to‑deploy, easy‑to‑use, high‑performance distributed inference framework that supports multiple open‑source large language models and provides both a Web GUI and API for model serving.

1. Xinference Overview

Xinference enables one‑click deployment of your own models or built‑in cutting‑edge open‑source models, suitable for researchers, developers, and data scientists.

2. Installation

2.1 Docker Installation

On Linux or Windows servers with Docker and CUDA installed, run:

docker pull xprobe/xinference:latest
docker run -p 9997:9997 --gpus all xprobe/xinference:latest xinference‑local -H 0.0.0.0

After starting, the service is reachable at port 9997.

2.2 Local Source Installation

Create a Python 3.10 environment:

conda create --name xinference python=3.10
conda activate xinference

Install Xinference with desired inference engines:

pip install "xinference[transformers]" -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install "xinference[vllm]" -i https://pypi.tuna.tsinghua.edu.cn/simple
# or install all backends
pip install "xinference[all]" -i https://pypi.tuna.tsinghua.edu.cn/simple

Start the service: xinference‑local -H 0.0.0.0 The default port is 9997; with -H 0.0.0.0 it is accessible from other machines.

3. Deploy a Local Model (example: Qwen‑14B)

3.1 Web UI Launch

Open http://localhost:9997, select “qwen‑chat” in the “Launch Model” tab, configure parameters, and click the launch button. The model UID is qwen‑chat.

3.2 Command‑Line Launch

Run:

xinference launch -n qwen‑chat -s 14 -f pytorch

3.3 Model Source Settings

By default models are downloaded from HuggingFace. In China you can set environment variables to use a mirror:

export HF_ENDPOINT=https://hf-mirror.com
export XINFERENCE_MODEL_SRC=modelscope
export XINFERENCE_HOME=/path/to/xinference

4. Model Application

4.1 Prepare

After the model is started, a local web page appears for simple chat testing.

Copy the Model ID displayed under the title for later use.

4.2 Curl Example

curl --location --request POST 'http://192.2.22.55:9997/v1/chat/completions' \
--header 'Authorization: Bearer YOUR_KEY' \
--header 'Content-Type: application/json' \
--data-raw '{"model":"qwen1.5‑chat","messages":[{"role":"user","content":"..."}]}'

4.3 Chatbox Integration

Install Chatbox v1.0.0 or later.

In settings choose OpenAI as provider, set the API domain to the Xinference address (e.g., http://192.2.22.55:997/), and enter the copied Model ID as a custom model name.

4.4 Start Conversing

After these steps you can chat with the locally hosted LLM; responses may depend on GPU capacity, but the setup is fully offline and free, preserving data privacy.

Docker LLM model deployment API Inference Xinference

Written by

Architect's Alchemy Furnace

A comprehensive platform that combines Java development and architecture design, guaranteeing 100% original content. We explore the essence and philosophy of architecture and provide professional technical articles for aspiring architects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.