Local Deployment, Inference, and Fine‑tuning of the Vicuna‑7B Large Language Model

This article details the step‑by‑step process of preparing the environment, merging weights, installing dependencies, running inference, evaluating Vicuna‑7B against other models, and attempting fine‑tuning, while highlighting performance results, encountered issues, and future work for large language model deployment.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Local Deployment, Inference, and Fine‑tuning of the Vicuna‑7B Large Language Model

The author begins by reviewing the unsatisfactory performance of a previously deployed Alpaca‑LoRA model on Chinese instructions and motivates the exploration of Vicuna‑7B, which reportedly reaches over 90% of ChatGPT’s capability while being inexpensive to train.

Environment preparation includes upgrading GCC to version 13.1, installing CUDA 11.7 via the runfile installer, and adding cuDNN and NCCL RPM packages. Commands for extracting, configuring, and compiling GCC are provided, wrapped in tar -xzf gcc-13.1.0.tar.gz, cd gcc-13.1.0, ./contrib/download_prerequisites, mkdir build,

../configure --enable-checking=release --enable-languages=c,c++ --disable-multilib

, and subsequent make and make install steps.

To obtain Vicuna, the original LLaMA‑7B model (26 GB) is cloned via

git lfs clone https://huggingface.co/decapoda-research/llama-7b-hf

and the Vicuna delta weights are fetched with

git lfs clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1

. The two are merged using the FastChat utility:

python -m fastchat.model.apply_delta \
--base ./model/llama-7b-hf \
--delta ./model/vicuna-7b-delta-v1.1 \
--target ./model/vicuna-7b-all-v1.1

which results in a 13 GB combined checkpoint.

Required Python dependencies (fschat, tensorboardX, flash‑attn) are installed via pip install fschat tensorboardX. Flash‑attn fails on the original GCC, prompting the compiler upgrade described earlier.

Model inference is performed with FastChat’s CLI:

python -m fastchat.serve.cli --model-path ./model/vicuna-7b-all-v1.1 --style rich

Alternative flags for 8‑bit loading, CPU execution, or multi‑GPU usage are also shown.

Evaluation compares Vicuna’s responses on recipe recommendation, multilingual queries, code generation, and math problems. Results indicate strong multilingual ability and decent coding assistance, but occasional hallucinations in recipes and inaccurate simple arithmetic.

Fine‑tuning attempt uses a torchrun command with multiple GPUs, specifying dataset, training hyper‑parameters, and FSDP settings. The process fails because the Tesla P40 (SM_62) lacks the required SM_75 architecture for the model’s fine‑tuning kernels.

In conclusion, Vicuna‑7B delivers superior inference speed (≈1 s per query on a single GPU) and better multilingual performance than Alpaca, making it a solid open‑source LLM candidate. Future work includes fine‑tuning on newer GPUs, integrating the model into specific applications, and building a production‑ready service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Model deploymentfine-tuninglarge language modelGPUInferenceVicuna
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.