Deploying and Evaluating the Vicuna Open‑Source Large Language Model on a Single Machine
This article details a step‑by‑step guide to deploying the Vicuna open‑source LLM on a single server, covering model preparation, environment setup, dependency installation, GPU and CUDA configuration, inference commands, performance evaluation, and attempted fine‑tuning, while sharing practical observations and results.
Vicuna is a leading open‑source large language model that excels in semantic understanding, multilingual support, and inference quality. This guide walks through a complete single‑machine deployment, explores practical details, and evaluates inference performance.
Background – Previous experiments with Alpaca‑LoRA showed limited Chinese support and slow inference. Vicuna‑13B reportedly reaches over 90% of ChatGPT’s capabilities, outperforming LLaMA‑13B and Alpaca‑13B, with low training cost (~$300).
Environment Preparation
Download the base LLaMA‑7B model and Vicuna delta weights using Git LFS:
git lfs clone https://huggingface.co/decapoda-research/llama-7b-hf git lfs clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1Merge the delta into the base model:
python -m fastchat.model.apply_delta \
--base ./model/llama-7b-hf \
--delta ./model/vicuna-7b-delta-v1.1 \
--target ./model/vicuna-7b-all-v1.1The merged model size becomes ~13 GB.
Dependency Installation
Install required Python packages: pip install fschat tensorboardX Flash‑attn fails with older GCC; upgrade GCC to 13.1:
tar -xzf gcc-13.1.0.tar.gz
cd gcc-13.1.0
./contrib/download_prerequisites
mkdir build
cd build
../configure --enable-checking=release --enable-languages=c,c++ --disable-multilib
make -j8
make installUpdate symbolic links and libraries so the new compiler is used system‑wide.
CUDA and cuDNN Installation
Download CUDA 11.7 runfile for CentOS 7, install, and add the following to .bash_profile:
export PATH=/usr/local/cuda-11.7/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATHInstall cuDNN and NCCL RPM packages from NVIDIA after registering an account.
Model Inference
Run the model with FastChat:
python -m fastchat.serve.cli --model-path ./model/vicuna-7b-all-v1.1 --style richOptional flags: --load-8bit for reduced memory, --device cpu (slow), --num-gpus 3 for multi‑GPU.
Test cases include recipe recommendation, multilingual queries, code generation, math calculations, and casual conversation. GPU usage stays around 60% of memory (≈13 GB) with near‑full compute utilization.
Fine‑tuning Attempt
Attempted fine‑tuning with torchrun using a dummy dataset:
torchrun --nproc_per_node=3 --master_port=40001 ./FastChat/fastchat/train/train_mem.py \
--model_name_or_path ./model/llama-7b-hf \
--data_path dummy.json \
--bf16 False \
--output_dir ./model/vicuna-dummy \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess TrueThe run failed because the Tesla P40 (SM 62) does not meet the SM 75+ requirement for Vicuna fine‑tuning.
Results and Observations
Inference quality is good for casual dialogue and multilingual support, but recipe recommendations can be nonsensical.
Code generation works for simple snippets but may need manual refinement.
Basic arithmetic is still weak.
GPU memory usage is modest; single‑GPU inference achieves near‑second latency.
Future Work
Fine‑tune on newer GPUs (e.g., RTX 4090, A100) to unlock full potential.
Identify concrete application scenarios for the model.
Wrap Vicuna into a production‑ready service.
Continue exploring large‑model capabilities.
Overall, Vicuna‑7B provides strong inference performance and efficiency, making it a solid choice for open‑source LLM projects.
Recommended Reading
如何进行测试分析与设计-HTSM启发式测试策略模型
消失的死锁:从 JSF 线程池满到 JVM 初始化原理剖析
GPT大语言模型Alpaca‑lora本地化部署实践
Mybatis的parameterType造成线程阻塞问题分析
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
