Artificial Intelligence 16 min read

Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine

This article details a step‑by‑step guide to deploying the Vicuna open‑source LLM on a single server, covering model preparation, environment setup, dependency installation, GPU and CUDA configuration, inference commands, performance evaluation, and attempted fine‑tuning, while sharing practical observations and results.

JD Tech

Aug 4, 2023

Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine

Vicuna is a leading open‑source large language model that excels in semantic understanding, multilingual support, and inference quality. This guide walks through a complete single‑machine deployment, explores practical details, and evaluates inference performance.

Background – Previous experiments with Alpaca‑LoRA showed limited Chinese support and slow inference. Vicuna‑13B reportedly reaches over 90% of ChatGPT’s capabilities, outperforming LLaMA‑13B and Alpaca‑13B, with low training cost (~$300).

Environment Preparation

Download the base LLaMA‑7B model and Vicuna delta weights using Git LFS:

git lfs clone https://huggingface.co/decapoda-research/llama-7b-hf

git lfs clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1

Merge the delta into the base model:

python -m fastchat.model.apply_delta \
    --base ./model/llama-7b-hf \
    --delta ./model/vicuna-7b-delta-v1.1 \
    --target ./model/vicuna-7b-all-v1.1

The merged model size becomes ~13 GB.

Dependency Installation

Install required Python packages: pip install fschat tensorboardX Flash‑attn fails with older GCC; upgrade GCC to 13.1:

tar -xzf gcc-13.1.0.tar.gz
cd gcc-13.1.0
./contrib/download_prerequisites
mkdir build
cd build
../configure --enable-checking=release --enable-languages=c,c++ --disable-multilib
make -j8
make install

Update symbolic links and libraries so the new compiler is used system‑wide.

CUDA and cuDNN Installation

Download CUDA 11.7 runfile for CentOS 7, install, and add the following to .bash_profile:

export PATH=/usr/local/cuda-11.7/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH

Install cuDNN and NCCL RPM packages from NVIDIA after registering an account.

Model Inference

Run the model with FastChat:

python -m fastchat.serve.cli --model-path ./model/vicuna-7b-all-v1.1 --style rich

Optional flags: --load-8bit for reduced memory, --device cpu (slow), --num-gpus 3 for multi‑GPU.

Test cases include recipe recommendation, multilingual queries, code generation, math calculations, and casual conversation. GPU usage stays around 60% of memory (≈13 GB) with near‑full compute utilization.

Fine‑tuning Attempt

Attempted fine‑tuning with torchrun using a dummy dataset:

torchrun --nproc_per_node=3 --master_port=40001 ./FastChat/fastchat/train/train_mem.py \
    --model_name_or_path ./model/llama-7b-hf \
    --data_path dummy.json \
    --bf16 False \
    --output_dir ./model/vicuna-dummy \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

The run failed because the Tesla P40 (SM 62) does not meet the SM 75+ requirement for Vicuna fine‑tuning.

Results and Observations

Inference quality is good for casual dialogue and multilingual support, but recipe recommendations can be nonsensical.

Code generation works for simple snippets but may need manual refinement.

Basic arithmetic is still weak.

GPU memory usage is modest; single‑GPU inference achieves near‑second latency.

Future Work

Fine‑tune on newer GPUs (e.g., RTX 4090, A100) to unlock full potential.

Identify concrete application scenarios for the model.

Wrap Vicuna into a production‑ready service.

Continue exploring large‑model capabilities.

Overall, Vicuna‑7B provides strong inference performance and efficiency, making it a solid choice for open‑source LLM projects.

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.