Step‑by‑Step Guide to Efficient LLM Fine‑Tuning with LoRA, QLoRA, and Llama‑Factory

This tutorial explains the concepts, methods, and practical commands for fine‑tuning large language models using efficient techniques like LoRA and QLoRA, covering model selection, resource considerations, Docker deployment, dataset preparation, training configuration, evaluation metrics, model merging, and deployment with GGUF and Ollama.

AI Cyberspace
AI Cyberspace
AI Cyberspace
Step‑by‑Step Guide to Efficient LLM Fine‑Tuning with LoRA, QLoRA, and Llama‑Factory

Fine‑tuning adapts a pre‑trained large language model (LLM) to specific tasks or domains by further training on targeted data, improving performance while requiring far less data and compute than training from scratch.

Why Fine‑Tune and Common Use Cases

Change dialogue style : Adjust tone, politeness, or response style for chatbots, virtual assistants, etc.

Inject private knowledge : Add domain‑specific terminology and rules for fields such as law, medicine, or IT.

Boost reasoning ability : Enhance long‑text comprehension and logical inference for complex QA.

Support agents : Teach the model function‑calling strategies for API integration.

Most open‑source models are released as a Base version (pre‑trained only) and a Instruction‑tuned version (full fine‑tuning).

Fine‑Tuning Techniques

Efficient fine‑tuning (LoRA / QLoRA) : Update only a small set of low‑rank adapter parameters, drastically reducing GPU memory and compute.

Full fine‑tuning : Update all model weights; suitable for high‑capacity GPUs and when extensive model changes are needed.

Efficient methods require 100–1000 example prompt‑response pairs, while full fine‑tuning needs >1000.

Hardware Considerations

Fine‑tuning is GPU‑intensive; even a 7B model may need ~100 GB VRAM for full fine‑tuning. LoRA/QLoRA reduce memory demand, allowing training on consumer GPUs (e.g., 32 GB) for models up to ~30 B parameters.

Performance Benchmarks (PGX/DGX)

Llama 3.2 3B Full‑FT: 82,739 tokens/s

Llama 3.1 8B LoRA: 53,658 tokens/s

Llama 3.3 70B QLoRA: 5,079 tokens/s

LoRA Overview

LoRA inserts low‑rank adapter layers into selected model layers, keeping the original weights frozen. Only the adapters are trained, cutting memory usage and enabling fine‑tuning on limited hardware.

Advantages : Memory‑efficient, faster training, easy integration with existing models, applicable to generation, classification, QA, etc.

QLoRA Overview

QLoRA extends LoRA by quantizing the adapter weights (FP4/INT4/INT8), further lowering memory and compute requirements, making it suitable for edge devices.

Advantages : Works with very limited VRAM, supports larger models, ideal for low‑latency inference.

Llama‑Factory Platform

Llama‑Factory, built on the transformers library, provides an all‑in‑one interface for pre‑training, instruction fine‑tuning, reward‑model training, PPO, DPO, KTO, ORPO, and supports Accelerate or DeepSpeed back‑ends.

Installation & Docker Deployment

git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory/docker/docker-cuda/
# Build image
docker build -f Dockerfile \
  --build-arg PIP_INDEX=https://pypi.org/simple \
  --build-arg EXTRAS=metrics \
  -t llamafactory:latest .
# Run container
docker run -dit --ipc=host --gpus=all \
  -p 7860:7860 -p 8000:8000 \
  --name llamafactory llamafactory:latest
docker exec -it llamafactory bash

Dataset Preparation

Use ModelScope or other sources to obtain an Alpaca‑ or ShareGPT‑format dataset (e.g., huanhuan‑chat). Place the JSON file in data/ and add an entry to dataset_info.json so Llama‑Factory can recognize it.

Training Example (Qwen 3‑1.7B‑Base + LoRA)

llamafactory-cli train \
  --stage sft \
  --model_name_or_path Qwen/Qwen3-1.7B-Base \
  --finetuning_type lora \
  --template qwen3 \
  --dataset_dir data \
  --dataset huanhuan \
  --cutoff_len 1024 \
  --learning_rate 5e-5 \
  --num_train_epochs 4.0 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --lr_scheduler_type cosine \
  --lora_rank 8 \
  --lora_alpha 256 \
  --output_dir saves/Qwen3-1.7B-Base/lora/train_2026-01-02-... \
  --bf16 True \
  --trust_remote_code True

Key arguments explained: stage sft: Supervised fine‑tuning. do_train True: Enable training mode. cutoff_len: Truncate inputs to 1024 tokens (saves memory). lora_rank and lora_alpha: Control adapter capacity and scaling. bf16 True: Use bfloat16 precision for faster computation.

Evaluation

After training, run batch inference with metrics such as BLEU‑4 and ROUGE‑1/2/L to assess quality, and record performance metrics (preparation time, runtime, samples‑per‑second, steps‑per‑second).

llamafactory-cli train \
  ...
  --do_predict True \
  --predict_with_generate True \
  --eval_dataset huanhuan \
  --max_new_tokens 512 \
  --temperature 0.95 \
  --top_p 0.7 \
  --output_dir saves/Qwen3-1.7B-Base/lora/eval_2026-01-02-...

Typical results: BLEU‑4 ≈ 0.85, ROUGE‑1 ≈ 10.4, ROUGE‑2 ≈ 1.7, ROUGE‑L ≈ 4.1, preparation time 0.002 s, inference throughput ~0.24 samples/s.

Model Merging & Export

Merge LoRA adapters into the base model to produce a standalone checkpoint:

llamafactory-cli export \
  --finetuning_type lora \
  --adapter_name_or_path saves/.../lora/train_... \
  --output_dir output/Qwen3-1.7B-huanhuan

The exported directory contains model.safetensors, tokenizer files, and a Modelfile for Ollama.

Deployment with GGUF & Ollama

# Convert to GGUF
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/gguf-py
pip install --editable .
cd ..
python convert_hf_to_gguf.py /workspace/LlamaFactory/output/Qwen3-1.7B-huanhuan/
# Install Ollama and serve model
curl -fsSL https://ollama.com/install.sh | sh
ollama serve
ollama create qwen3-huanhuan -f /workspace/LlamaFactory/output/Qwen3-1.7B-huanhuan/Modelfile
ollama run qwen3-huanhuan

Key Takeaways

Efficient fine‑tuning (LoRA/QLoRA) enables adaptation of large models on modest hardware.

Llama‑Factory streamlines dataset handling, training, evaluation, and export.

Quantization (GGUF) and lightweight serving (Ollama) make the resulting model practical for personal or edge deployment.

LoRAQLoRAmodel evaluationLLM fine-tuningOllamaGPU memory optimizationGGUF
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.