How to Deploy and Run Large Language Models with Xinference: A Step‑by‑Step Guide
Xinference is a powerful distributed inference framework that enables quick deployment and efficient serving of open‑source large language models via Docker or source installation, offering Web UI, CLI, and API interfaces with detailed setup, model launching, and Chatbox integration instructions.
Xinference is a fast‑to‑deploy, easy‑to‑use, high‑performance distributed inference framework that supports multiple open‑source large language models and provides both a Web GUI and API for model serving.
1. Xinference Overview
Xinference enables one‑click deployment of your own models or built‑in cutting‑edge open‑source models, suitable for researchers, developers, and data scientists.
2. Installation
2.1 Docker Installation
On Linux or Windows servers with Docker and CUDA installed, run:
docker pull xprobe/xinference:latest
docker run -p 9997:9997 --gpus all xprobe/xinference:latest xinference‑local -H 0.0.0.0After starting, the service is reachable at port 9997.
2.2 Local Source Installation
Create a Python 3.10 environment:
conda create --name xinference python=3.10
conda activate xinferenceInstall Xinference with desired inference engines:
pip install "xinference[transformers]" -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install "xinference[vllm]" -i https://pypi.tuna.tsinghua.edu.cn/simple
# or install all backends
pip install "xinference[all]" -i https://pypi.tuna.tsinghua.edu.cn/simpleStart the service: xinference‑local -H 0.0.0.0 The default port is 9997; with -H 0.0.0.0 it is accessible from other machines.
3. Deploy a Local Model (example: Qwen‑14B)
3.1 Web UI Launch
Open http://localhost:9997, select “qwen‑chat” in the “Launch Model” tab, configure parameters, and click the launch button. The model UID is qwen‑chat.
3.2 Command‑Line Launch
Run:
xinference launch -n qwen‑chat -s 14 -f pytorch3.3 Model Source Settings
By default models are downloaded from HuggingFace. In China you can set environment variables to use a mirror:
export HF_ENDPOINT=https://hf-mirror.com
export XINFERENCE_MODEL_SRC=modelscope
export XINFERENCE_HOME=/path/to/xinference4. Model Application
4.1 Prepare
After the model is started, a local web page appears for simple chat testing.
Copy the Model ID displayed under the title for later use.
4.2 Curl Example
curl --location --request POST 'http://192.2.22.55:9997/v1/chat/completions' \
--header 'Authorization: Bearer YOUR_KEY' \
--header 'Content-Type: application/json' \
--data-raw '{"model":"qwen1.5‑chat","messages":[{"role":"user","content":"..."}]}'4.3 Chatbox Integration
Install Chatbox v1.0.0 or later.
In settings choose OpenAI as provider, set the API domain to the Xinference address (e.g., http://192.2.22.55:997/), and enter the copied Model ID as a custom model name.
4.4 Start Conversing
After these steps you can chat with the locally hosted LLM; responses may depend on GPU capacity, but the setup is fully offline and free, preserving data privacy.
Architect's Alchemy Furnace
A comprehensive platform that combines Java development and architecture design, guaranteeing 100% original content. We explore the essence and philosophy of architecture and provide professional technical articles for aspiring architects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
