How to Set Up Xinference with NVIDIA RTX 4090 on Oracle Linux: A Step‑by‑Step Guide
This guide walks you through configuring a high‑performance AI inference server on Oracle Linux, covering hardware specs, NVIDIA driver and CUDA installation, Conda environment setup, Xinference deployment, service startup, and example model loading commands, all with clear code snippets and images.
1. Server Information
CPU: Intel(R) Xeon(R) Gold 5416S
Memory: 128GB
Disk: 2 x 1TB SSD
GPU: 2 x NVIDIA RTX 4090 24GB
OS: Oracle Linux 8.10
2. Install NVIDIA Driver and CUDA
2.1 Install kernel headers and dependencies
yum install -y kernel-uek-devel gcc make dkms elfutils-libelf-develAfter installation, ensure that kernel and kernel-devel versions match.
2.2 Download and install the driver
Driver download URL: https://www.nvidia.com/en-us/drivers/results/
Example driver file:
NVIDIA-Linux-x86_64-570.133.07.run chmod +x NVIDIA-Linux-x86_64-570.133.07.run
./NVIDIA-Linux-x86_64-570.133.07.run2.3 Download and install CUDA
CUDA 12.4 runfile (local) download URL: https://developer.nvidia.com/cuda-12-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=RHEL&target_version=8&target_type=runfile_local
chmod +x cuda_12.4.0_550.54.14_linux.run
./cuda_12.4.0_550.54.14_linux.run2.4 Verify driver installation
nvidia-smiIf the GPU information is displayed correctly, the driver is installed successfully.
3. Install Conda
3.1 Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.shAfter the interactive installation, activate Conda:
source ~/.bashrc3.2 Create and activate a virtual environment
conda create -n xinference python=3.10 -y
conda activate xinference4. Install Xinference and Related Dependencies
4.1 Install Xinference (vllm version)
pip install "xinference[vllm]"4.2 Install flashinfer
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.44.3 Install sentence‑transformers
pip install sentence-transformers5. Start Xinference Service
Ensure the Xinference environment is activated:
conda activate xinferenceStart the service with the following command (adjust GPU IDs as needed):
nohup bash -c "HF_ENDPOINT=https://hf-mirror.com XINFERENCE_HOME=/data/xinferenceNew CUDA_VISIBLE_DEVICES=0,1 xinference-local --host 0.0.0.0 --port 9997" &Command‑line parameters: nohup: run the command in the background. bash -c "...": execute a series of settings and commands. HF_ENDPOINT=https://hf-mirror.com: use a domestic mirror for Hugging Face to speed up model downloads. XINFERENCE_HOME=/data/xinference: set Xinference working directory for model files and cache. CUDA_VISIBLE_DEVICES=0,1: specify which GPUs (two GPUs in this case) to use. xinference-local: start Xinference in local mode. --host 0.0.0.0: listen on all IPs, allowing external access. --port 9997: web service listening port.
6. Model Loading Examples
Access the UI at
http://10.0.110.20:9997/6.1 Start a reranker model
6.2 Start an embedding model
6.3 Start a language model
Key parameters for language model inference:
gpu_memory_utilization : limit each GPU to use at most 80% of its memory (value between 0 and 1).
max_num_seqs : maximum number of concurrent inference sequences (set to 30 in this example).
max_model_len : maximum token length of model input, up to 40960 tokens.
7. Common Conda Commands
7.1 List all environments
conda env list7.2 Create a new environment
conda create --name myenv python=3.107.3 Activate an environment
conda activate myenv7.4 Deactivate the current environment
conda deactivate7.5 Remove an environment
conda remove --name myenv --all7.6 Install packages
conda install numpy pandas matplotlib7.7 Update a package
conda update numpyArchitect's Alchemy Furnace
A comprehensive platform that combines Java development and architecture design, guaranteeing 100% original content. We explore the essence and philosophy of architecture and provide professional technical articles for aspiring architects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
