How PEFT Transforms Large Model Fine‑Tuning: Additive, Prompt & LoRA Methods Explained

This article introduces parameter‑efficient fine‑tuning (PEFT) techniques—including additive adapters, soft‑prompt methods, selection‑based BitFit, and re‑parameterization approaches like LoRA and AdaLoRA—explains their architectures, experimental results, and provides end‑to‑end code for fine‑tuning ChatGLM2‑6B on a Chinese medical QA dataset.

UCloud Tech
UCloud Tech
UCloud Tech
How PEFT Transforms Large Model Fine‑Tuning: Additive, Prompt & LoRA Methods Explained

Parameter‑Efficient Fine‑Tuning (PEFT) Overview

PEFT fine‑tunes a pre‑trained model by freezing most of its parameters and updating only a small subset, achieving performance close to full fine‑tuning with far fewer trainable weights.

1. Additive Methods

1.1 Adapter Tuning

Adapter modules are inserted after each sub‑layer of every Transformer block. During fine‑tuning, only the adapter layers are trained while the original model parameters remain fixed.

Adapter characteristics:

The adapter consists of two feed‑forward sub‑layers.

The first sub‑layer projects the original dimension d to a smaller dimension m , applies a non‑linearity, then projects back to d .

Total parameters: 2md + d + m . Setting m < d limits added parameters.

When the projection layer is initialized near zero, a skip‑connection makes the module behave like an identity function, ensuring stable fine‑tuning.

Adapter experimental results: Using BERT as the base model, adapter fine‑tuning achieves comparable performance to full fine‑tuning while using only 0.5‑5% of the original parameters.

1.2 Soft Prompts

Soft‑prompt methods replace discrete hard prompts with continuous prompt embeddings that are learned via back‑propagation.

Prefix Tuning

Only a continuous prefix (shown in red) is optimized and added to every Transformer block.

Prefix Tuning characteristics:

Pre‑trained model parameters are frozen; a task‑specific continuous prefix is stored for each task, saving space.

An additional MLP layer is introduced during training for stability.

Different prefixes are constructed for different models.

Prefix Tuning experimental results: On table‑to‑text tasks, Prefix Tuning outperforms full fine‑tuning and Adapter‑Tuning with GPT‑2‑Medium and GPT‑2‑Large; on summarisation tasks, it lags behind full fine‑tuning.

P‑Tuning

P‑Tuning inserts learnable virtual tokens into the input embedding layer.

P‑Tuning characteristics:

Only the input layer receives trainable virtual tokens, which are automatically inserted into the token sequence.

The virtual tokens need not be a prefix; their insertion position is flexible.

P‑Tuning experimental results: Using GPT and BERT families, P‑Tuning matches full‑parameter performance and can even surpass it on some tasks.

Prompt Tuning

Prompt Tuning adds a prompt only at the input layer without extra MLPs.

Prompt Tuning characteristics:

Only the input layer is modified; no additional MLP is required.

Prompt Ensembling trains multiple prompts in parallel, effectively creating several independent “models” that share the core language model.

Prompt Tuning experimental results: On the SuperGLUE benchmark, performance is comparable to traditional fine‑tuning and improves with larger model scales; it also enhances zero‑shot transfer.

P‑Tuning v2

P‑Tuning v2 adds trainable tokens to every layer, increasing task capacity while keeping parameter efficiency.

P‑Tuning v2 characteristics: Each layer receives tokens, allowing deeper influence on predictions.

P‑Tuning v2 experimental results: Using BERT and GLM families, performance is close to full fine‑tuning across various NLU tasks.

2. Selection Methods

2.1 BitFit

BitFit is a sparse fine‑tuning approach that updates only bias terms (or a subset thereof) in the model.

BitFit characteristics:

Most Transformer encoder parameters are frozen; only bias terms and the task‑specific classification head are trained.

Biases updated include those in the Query/Key/Value projections of attention modules, MLP layers, and LayerNorm bias parameters.

Each new task stores only the bias vector (less than 0.1% of total parameters) plus the final linear classifier.

BitFit experimental results: Using BERT‑BASE, BERT‑LARGE, and RoBERTa‑BASE, BitFit underperforms full fine‑tuning but far exceeds a completely frozen model.

3. Re‑parameterization Methods

3.1 LoRA

LoRA assumes that after fine‑tuning, weight matrices have low effective rank. It decomposes each target weight matrix W into a fixed pre‑trained part plus a low‑rank update BA .

LoRA characteristics:

The low‑rank matrices B and A are added to W without incurring inference latency.

The modules are plug‑and‑play, enabling easy switching between tasks.

LoRA experimental results: Tested on RoBERTa, DeBERTa, GPT‑2, and GPT‑3‑175B, LoRA achieves performance comparable to full fine‑tuning and sometimes surpasses it.

3.2 AdaLoRA

AdaLoRA adapts the rank of each module based on weight importance, effectively an upgraded LoRA.

AdaLoRA characteristics: It adds a penalty term to the loss to keep the factorized matrices close to orthogonal, ensuring stable training.

AdaLoRA experimental results: Using DeBERTaV3‑BASE and BART‑LARGE, AdaLoRA often outperforms higher‑parameter methods; on the CoLA dataset it achieves an MCC of 70.04 with only 0.32 M trainable parameters.

Summary of PEFT Methods

Adapter methods are simple but add inference latency; Soft‑Prompt methods avoid hard prompts but may be harder to converge. Prefix Tuning introduces trainable tokens at every layer, while P‑Tuning and Prompt Tuning insert continuous prompts only at the input. BitFit updates only biases, yielding minimal parameter growth but lower performance than LoRA or Adapter. LoRA introduces no inference delay, and AdaLoRA improves upon LoRA by adaptively allocating rank.

AdaLoRA Experiment on ChatGLM2‑6B

The experiment fine‑tunes ChatGLM2‑6B on a Chinese medical QA dataset (pediatrics and surgery, 10 000 examples each). The dataset is stored as JSON.

from torch.utils.data import Dataset
import torch
import json
import numpy as np
from torch.nn.utils.rnn import pad_sequence
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from tqdm import tqdm
import sys

class my_dataset(Dataset):
    def __init__(self, data_path, tokenizer, max_source_length, max_target_length, is_train=True):
        super().__init__()
        self.tokenizer = tokenizer
        self.max_source_length = max_source_length
        self.max_target_length = max_target_length
        self.max_seq_length = self.max_source_length + self.max_target_length
        self.data_path = data_path
        self.data = self._load_data()
        self.is_train = is_train
    def __len__(self):
        return len(self.data)
    def __getitem__(self, index):
        item_data = self.data[index]
        if self.is_train:
            model_inputs = self._preprocess(**item_data)
            return model_inputs
    def _load_data(self):
        data = []
        with open(self.data_path, "r", encoding='utf-8') as f:
            for line in f:
                if not line or line == "":
                    continue
                json_line = json.loads(line)
                ask = json_line.get("ask")
                answer = json_line.get("answer")
                if ask and answer:
                    data.append({"question": ask, "answer": answer})
        return data
    def _preprocess(self, question, answer):
        model_inputs = {"input_ids": None, "labels": None}
        Prompt = self.tokenizer.build_Prompt(question, None)
        a_ids = self.tokenizer.encode(text=Prompt, add_special_tokens=True, truncation=True, max_length=self.max_source_length)
        b_ids = self.tokenizer.encode(text=answer, add_special_tokens=False, truncation=True, max_length=self.max_target_length)
        context_length = len(a_ids)
        input_ids = a_ids + b_ids + [self.tokenizer.eos_token_id]
        labels = [self.tokenizer.pad_token_id] * context_length + b_ids + [self.tokenizer.eos_token_id]
        pad_len = self.max_seq_length - len(input_ids)
        input_ids = input_ids + [self.tokenizer.pad_token_id] * pad_len
        labels = labels + [self.tokenizer.pad_token_id] * pad_len
        labels = [(l if l != self.tokenizer.pad_token_id else -100) for l in labels]
        model_inputs["input_ids"] = torch.tensor(input_ids, dtype=torch.long)
        model_inputs["labels"] = torch.tensor(labels, dtype=torch.long)
        return model_inputs
from transformers import AutoTokenizer, AutoModel
from peft import AdaLoraConfig, get_peft_model, TaskType
from dataset import my_dataset
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
import pandas as pd
import os, sys
import argparse
import shutil
from accelerate import Accelerator, DeepSpeedPlugin

parser = argparse.ArgumentParser()
parser.add_argument("--model_name", type=str, default="/data/chatglm2-6b")
parser.add_argument("--r", type=int, default=8)
parser.add_argument("--lora_alpha", type=int, default=32)
parser.add_argument("--lora_dropout", type=float, default=0.01)
parser.add_argument("--epochs", type=int, default=5)
parser.add_argument("--batch_size", type=int, default=1)
parser.add_argument("--max_source_length", type=int, default=128)
parser.add_argument("--max_target_length", type=int, default=256)
parser.add_argument("--train_json_path", type=str, default="./test_data/train.json")
parser.add_argument("--lr", type=float, default=1e-4)
parser.add_argument("--model_output_dir", type=str, default="output")
args = parser.parse_args()

accelerator = Accelerator()
device = accelerator.device
accelerator.print(f'device {str(accelerator.device)} is used!')

def main():
    adaLoRA_config = AdaLoraConfig(
        peft_type="ADALORA", task_type="CAUSAL_LM",
        r=args.r, lora_alpha=args.lora_alpha,
        target_modules=["query_key_value"],
        lora_dropout=args.lora_dropout,
    )
    tokenizer = AutoTokenizer.from_pretrained(args.model_name, trust_remote_code=True)
    model = AutoModel.from_pretrained(args.model_name, trust_remote_code=True)
    model = get_peft_model(model, adaLoRA_config)
    print(model)
    model.print_trainable_parameters()
    model = model.half()
    train_set = my_dataset(args.train_json_path, tokenizer, args.max_source_length, args.max_target_length)
    train_loader = DataLoader(train_set, batch_size=args.batch_size, shuffle=True)
    optimizer = torch.optim.AdamW(params=model.parameters(), lr=args.lr)
    if os.path.exists(args.model_output_dir):
        shutil.rmtree(args.model_output_dir)
    os.makedirs(args.model_output_dir)
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)
    for epoch in range(args.epochs):
        total_loss = 0
        for step, batch in enumerate(tqdm(train_loader)):
            with accelerator.accumulate(model):
                outputs = model(**batch)
                loss_detach = outputs.loss.detach().cpu().float()
                total_loss += loss_detach
                accelerator.backward(outputs.loss)
                optimizer.step()
                optimizer.zero_grad()
        unwrapped_model = accelerator.unwrap_model(model)
        unwrapped_model.save_pretrained(os.path.join(args.model_output_dir, f'{epoch}_epoch'),
                                        save_function=accelerator.save,
                                        state_dict=accelerator.get_state_dict(model))

if __name__ == '__main__':
    main()
import torch
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import os, shutil

def ans():
    tokenizer = AutoTokenizer.from_pretrained("/data/chatglm2-6b", trust_remote_code=True)
    model = AutoModel.from_pretrained("/data/chatglm2-6b", trust_remote_code=True)
    model = PeftModel.from_pretrained(model, "output_20000_all5/y_epoch")
    model = model.half().to(torch.device("cuda:0"))
    model.eval()
    ask = input("请输入问题:")
    response, history = model.chat(tokenizer, ask, history=[])
    print("回答:", response)

if __name__ == "__main__":
    ans()

The training uses an Accelerate configuration ( config_accelerate.yml) that enables DeepSpeed with gradient clipping of 1.0, gradient accumulation steps of 16, and distributed type DEEPSPEED.

Experimental results show that AdaLoRA fine‑tunes only 0.0468% of the total parameters (≈2.9 M trainable out of 6.2 B) while achieving competitive performance on the medical QA task.

Conclusion

Beyond the three main PEFT categories, hybrid methods such as MAM‑Adapter and UniPELT combine ideas from adapters, prompts, and LoRA, often yielding better results at the cost of increased parameter count and inference latency. Future articles will explore acceleration and parallelism frameworks for large models.

LoRAadapterPEFTAdaLoRAparameter-efficient fine-tuningsoft prompts
UCloud Tech
Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.