Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine

This article details a step‑by‑step guide to deploying the Vicuna open‑source LLM on a single server, covering model preparation, environment setup, dependency installation, GPU and CUDA configuration, inference commands, performance evaluation, and attempted fine‑tuning, while sharing practical observations and results.

JD Tech
JD Tech
JD Tech
Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine

Vicuna is a leading open‑source large language model that excels in semantic understanding, multilingual support, and inference quality. This guide walks through a complete single‑machine deployment, explores practical details, and evaluates inference performance.

Background – Previous experiments with Alpaca‑LoRA showed limited Chinese support and slow inference. Vicuna‑13B reportedly reaches over 90% of ChatGPT’s capabilities, outperforming LLaMA‑13B and Alpaca‑13B, with low training cost (~$300).

Environment Preparation

Download the base LLaMA‑7B model and Vicuna delta weights using Git LFS:

git lfs clone https://huggingface.co/decapoda-research/llama-7b-hf
git lfs clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1

Merge the delta into the base model:

python -m fastchat.model.apply_delta \
    --base ./model/llama-7b-hf \
    --delta ./model/vicuna-7b-delta-v1.1 \
    --target ./model/vicuna-7b-all-v1.1

The merged model size becomes ~13 GB.

Dependency Installation

Install required Python packages: pip install fschat tensorboardX Flash‑attn fails with older GCC; upgrade GCC to 13.1:

tar -xzf gcc-13.1.0.tar.gz
cd gcc-13.1.0
./contrib/download_prerequisites
mkdir build
cd build
../configure --enable-checking=release --enable-languages=c,c++ --disable-multilib
make -j8
make install

Update symbolic links and libraries so the new compiler is used system‑wide.

CUDA and cuDNN Installation

Download CUDA 11.7 runfile for CentOS 7, install, and add the following to .bash_profile:

export PATH=/usr/local/cuda-11.7/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH

Install cuDNN and NCCL RPM packages from NVIDIA after registering an account.

Model Inference

Run the model with FastChat:

python -m fastchat.serve.cli --model-path ./model/vicuna-7b-all-v1.1 --style rich

Optional flags: --load-8bit for reduced memory, --device cpu (slow), --num-gpus 3 for multi‑GPU.

Test cases include recipe recommendation, multilingual queries, code generation, math calculations, and casual conversation. GPU usage stays around 60% of memory (≈13 GB) with near‑full compute utilization.

Fine‑tuning Attempt

Attempted fine‑tuning with torchrun using a dummy dataset:

torchrun --nproc_per_node=3 --master_port=40001 ./FastChat/fastchat/train/train_mem.py \
    --model_name_or_path ./model/llama-7b-hf \
    --data_path dummy.json \
    --bf16 False \
    --output_dir ./model/vicuna-dummy \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

The run failed because the Tesla P40 (SM 62) does not meet the SM 75+ requirement for Vicuna fine‑tuning.

Results and Observations

Inference quality is good for casual dialogue and multilingual support, but recipe recommendations can be nonsensical.

Code generation works for simple snippets but may need manual refinement.

Basic arithmetic is still weak.

GPU memory usage is modest; single‑GPU inference achieves near‑second latency.

Future Work

Fine‑tune on newer GPUs (e.g., RTX 4090, A100) to unlock full potential.

Identify concrete application scenarios for the model.

Wrap Vicuna into a production‑ready service.

Continue exploring large‑model capabilities.

Overall, Vicuna‑7B provides strong inference performance and efficiency, making it a solid choice for open‑source LLM projects.

Recommended Reading

如何进行测试分析与设计-HTSM启发式测试策略模型

消失的死锁:从 JSF 线程池满到 JVM 初始化原理剖析

GPT大语言模型Alpaca‑lora本地化部署实践

Mybatis的parameterType造成线程阻塞问题分析

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMModel deploymentGPUInferenceVicunaFine‑tuning
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.