Unlocking LLM Fine‑Tuning: From Architecture to LoRA, DPO and Deployment

This article provides a comprehensive guide to large language model fine‑tuning, covering model architecture, parameter and memory calculations, prompt engineering, data construction, LoRA and PEFT techniques, reinforcement learning methods such as DPO, and practical deployment workflows on internal platforms.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Unlocking LLM Fine‑Tuning: From Architecture to LoRA, DPO and Deployment

1. Fine‑Tuning Knowledge Overview

1.1 Understanding Large Models

Before diving into LLM fine‑tuning, we first visualize what a large model looks like, including its Transformer‑based structure, parameter count, precision, and GPU memory consumption.

1.1.1 Model Architecture

The seminal paper Attention Is All You Need introduced the Transformer, which replaces CNNs and RNNs with multi‑head self‑attention layers. An encoder consists of a Multi‑Head Attention block followed by an Add & Norm layer; the decoder adds a second Multi‑Head Attention. LLMs typically stack N decoder layers.

Transformer structure:

LLM typically uses a decoder‑only stack of N layers.

1.1.2 Model Parameters

For example, the LLaMA series offers 7B, 13B, 33B and 65B variants (B = billion parameters). The total parameter count for an L‑layer Transformer is L·(12h²+13h)+Vh, which can be approximated as 12·L·h² when h is large.

transformer 由 L 个相同的层组成,每个层分为 self‑attention 和 MLP 两部分。
self‑attention 块的参数量为:4h²+4h。
MLP 块的参数量为:8h²+9h。
每个 layer normalization 包含 2 个可训练参数,参数量为 4h。
因此每层参数量为:12h²+13h。
词嵌入参数量为 Vh,位置编码参数量为 N·h(可训练时)。
总参数量为 L(12h²+13h)+Vh ≈ 12Lh²。

1.1.3 Model Memory

GPU memory is estimated by multiplying the number of parameters by the bytes per value (4 bytes for fp32, 2 bytes for fp16). For a 70 B LLaMA‑70B model at fp16, the memory requirement is roughly 168 GB.

1.1.4 Model Storage

A 7 B LLaMA model stored in fp16 occupies about 13.5 GB (including layer‑norm fp32 values).

1.2 Model Fine‑Tuning

1.2.1 Prompt Engineering

Prompt engineering (structured, specific, clear prompts) is a low‑cost way to improve performance before resorting to full fine‑tuning.

Best practices from OpenAI: OpenAI Prompt Engineering Guide

1.2.2 Data Construction

High‑quality data is the ceiling for model performance. Techniques such as self‑instruct, seed‑set generation, and ROUGE‑L filtering are used to create diverse instruction data.

1.2.3 LoRA Fine‑Tuning

LoRA (Low‑Rank Adaptation) injects trainable low‑rank matrices into the attention weights (Wq, Wk, Wv, Wo) and MLP layers, reducing trainable parameters dramatically while achieving near‑full‑parameter fine‑tuning performance.

LoRA modifies the weight update as:

W' = W + BA, where B∈ℝ^{d×r}, A∈ℝ^{r×d}, r≪d

1.3 Reinforcement Learning (RLHF / DPO)

After SFT, reinforcement learning (e.g., PPO) can further improve model behavior. Direct Preference Optimization (DPO) simplifies training by using only a reference model and a policy model, maximizing the reward margin between positive and negative samples.

Objective: maximize reward(pos) – reward(neg) while keeping the reference model fixed.

2. Practical Fine‑Tuning Workflow & Tools

Internal platforms:

星云平台 – model training infrastructure

TuningFactory – LLaMA‑Factory based framework on 星云

Whale – model deployment service

idealab – unified API gateway for open‑source and closed‑source LLMs

External tools:

LLaMA‑Factory – open‑source fine‑tuning framework ( GitHub )

2.1 Data Construction with idealab

idealab supports ~50 models (Azure OpenAI, DALL·E, 阿里通义千问, 谷歌 VertexAI, etc.) via HTTP JSON calls.

-H "Content-Type:application/json"
-H 'X-AK: xxxx'
-d '{"model":"gpt-3.5-turbo","prompt":"你是谁"}'

2.2 Training Platform Selection & Job Submission

Small tasks can be fine‑tuned locally with LLaMA‑Factory; larger jobs use TuningFactory on 星云.

WORLD_SIZE=8
LR=1e-5
LORA_CKPT="digital_live_chat.sft_model_whale/version=v20.26/ckpt_id=checkpoint-210"
args="--stage dpo \
    --model_name_or_path=$MODEL_NAME \
    --do_train \
    --do_eval \
    --val_size 0.05 \
    --file_name=${INPUT} \
    --ranking \
    --system=system \
    --prompt=input \
    --chosen=pos \
    --rejected=neg \
    --finetuning_type lora \
    --lora_rank=64 \
    --lora_alpha=16 \
    --output_dir=local/tmp/ckpt_save_path/ \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 20 \
    --learning_rate=$LR \
    --bf16"

2.3 Model Deployment on Whale

Deploy the selected checkpoint via Whale UI or SDK. Example SDK call:

from whale import TextGeneration, VipServerLocator
from whale.util import Timeout
response = TextGeneration.chat(
    model="Qwen-72B-Chat-Pro",
    messages=msgs,
    stream=True,
    temperature=1.0,
    max_tokens=2000,
    timeout=Timeout(60, 20),
    top_p=0.8,
    extend_fields=extend_fields)

2.4 Inference Acceleration

Load model in fp16 or int8 to reduce memory.

Cache long system prompts.

Speculative decoding with a small distilled model.

Benchmark on Qwen2.5‑7B shows significant latency reduction.

2.5 Evaluation & Iteration

Use a realistic test set and fine‑grained metrics (human rating, automated scoring). Analyze good and bad cases, improve data quality, explore stronger base models, or apply RL techniques such as DPO.

Reference

Attention Is All You Need – arXiv

InstructGPT – arXiv

Self‑Instruct – arXiv

LoRA – arXiv

LLMprompt engineeringmodel deploymentLoRAreinforcement learningFine‑Tuning
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.