How to Set Up Xinference with NVIDIA RTX 4090 on Oracle Linux: A Step‑by‑Step Guide

This guide walks you through configuring a high‑performance AI inference server on Oracle Linux, covering hardware specs, NVIDIA driver and CUDA installation, Conda environment setup, Xinference deployment, service startup, and example model loading commands, all with clear code snippets and images.

Architect's Alchemy Furnace
Architect's Alchemy Furnace
Architect's Alchemy Furnace
How to Set Up Xinference with NVIDIA RTX 4090 on Oracle Linux: A Step‑by‑Step Guide

1. Server Information

CPU: Intel(R) Xeon(R) Gold 5416S

Memory: 128GB

Disk: 2 x 1TB SSD

GPU: 2 x NVIDIA RTX 4090 24GB

OS: Oracle Linux 8.10

2. Install NVIDIA Driver and CUDA

2.1 Install kernel headers and dependencies

yum install -y kernel-uek-devel gcc make dkms elfutils-libelf-devel

After installation, ensure that kernel and kernel-devel versions match.

2.2 Download and install the driver

Driver download URL: https://www.nvidia.com/en-us/drivers/results/

Example driver file:

NVIDIA-Linux-x86_64-570.133.07.run
chmod +x NVIDIA-Linux-x86_64-570.133.07.run
./NVIDIA-Linux-x86_64-570.133.07.run

2.3 Download and install CUDA

CUDA 12.4 runfile (local) download URL: https://developer.nvidia.com/cuda-12-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=RHEL&target_version=8&target_type=runfile_local

chmod +x cuda_12.4.0_550.54.14_linux.run
./cuda_12.4.0_550.54.14_linux.run

2.4 Verify driver installation

nvidia-smi

If the GPU information is displayed correctly, the driver is installed successfully.

3. Install Conda

3.1 Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

After the interactive installation, activate Conda:

source ~/.bashrc

3.2 Create and activate a virtual environment

conda create -n xinference python=3.10 -y
conda activate xinference

4. Install Xinference and Related Dependencies

4.1 Install Xinference (vllm version)

pip install "xinference[vllm]"

4.2 Install flashinfer

pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4

4.3 Install sentence‑transformers

pip install sentence-transformers

5. Start Xinference Service

Ensure the Xinference environment is activated:

conda activate xinference

Start the service with the following command (adjust GPU IDs as needed):

nohup bash -c "HF_ENDPOINT=https://hf-mirror.com XINFERENCE_HOME=/data/xinferenceNew CUDA_VISIBLE_DEVICES=0,1 xinference-local --host 0.0.0.0 --port 9997" &

Command‑line parameters: nohup: run the command in the background. bash -c "...": execute a series of settings and commands. HF_ENDPOINT=https://hf-mirror.com: use a domestic mirror for Hugging Face to speed up model downloads. XINFERENCE_HOME=/data/xinference: set Xinference working directory for model files and cache. CUDA_VISIBLE_DEVICES=0,1: specify which GPUs (two GPUs in this case) to use. xinference-local: start Xinference in local mode. --host 0.0.0.0: listen on all IPs, allowing external access. --port 9997: web service listening port.

6. Model Loading Examples

Access the UI at

http://10.0.110.20:9997/

6.1 Start a reranker model

6.2 Start an embedding model

6.3 Start a language model

Key parameters for language model inference:

gpu_memory_utilization : limit each GPU to use at most 80% of its memory (value between 0 and 1).

max_num_seqs : maximum number of concurrent inference sequences (set to 30 in this example).

max_model_len : maximum token length of model input, up to 40960 tokens.

7. Common Conda Commands

7.1 List all environments

conda env list

7.2 Create a new environment

conda create --name myenv python=3.10

7.3 Activate an environment

conda activate myenv

7.4 Deactivate the current environment

conda deactivate

7.5 Remove an environment

conda remove --name myenv --all

7.6 Install packages

conda install numpy pandas matplotlib

7.7 Update a package

conda update numpy
LinuxCUDAAI inferenceGPUNVIDIACondaXinference
Architect's Alchemy Furnace
Written by

Architect's Alchemy Furnace

A comprehensive platform that combines Java development and architecture design, guaranteeing 100% original content. We explore the essence and philosophy of architecture and provide professional technical articles for aspiring architects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.