Generating Custom QA Datasets with Large Language Models and Fine‑Tuning via LoRA
This article explains how to use a large language model to automatically convert long‑form texts into Alpaca‑style question‑answer pairs, build a LangChain processing chain, and then fine‑tune a model such as Phi‑3‑mini‑4k‑instruct with LoRA, providing full Python code examples.
Currently many fine‑tuning methods for large models can run on consumer‑grade GPUs, but preparing a suitable dataset is often the bottleneck; existing open datasets may not match the desired domain, and manually creating thousands of QA pairs is time‑consuming.
This guide shows how to let a large model generate its own QA dataset from arbitrary long texts (e.g., books, author works, chat logs) using prompt engineering, and then fine‑tune the model with the generated data.
1. Dataset Construction
Typical QA datasets follow the Alpaca format:
{
"instruction": "保持健康的三个提示。",
"input": "",
"output": "以下是保持健康的三个提示:\n\n1. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。\n\n2. 均衡饮食。每天食用新鲜的蔬菜水果、全谷物和低脂蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯。\n\n3. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。"
}In practice, source material is often a long paragraph of text. The goal is to transform such passages into the Alpaca JSON structure automatically.
2. Prompt Design
System prompt emphasizes extracting dialogues or QA pairs from the provided context:
QA_PAIRS_SYSTEM_PROMPT = """
标记中是一段文本,学习和分析它,并整理学习成果:
- 提出问题并给出每个问题的答案。
- 答案需详细完整,尽可能保留原文描述。
- 答案可以包含普通文字、链接、代码、表格、公示、媒体链接等 Markdown 元素。
- 最多提出 30 个问题。
"""Human prompt specifies the desired output format (a list of dictionaries):
QA_PAIRS_HUMAN_PROMPT = """
请按以下格式整理学习成果:
文本
[
{"question": "问题1", "answer": "答案1"},
{"question": "问题2", "answer": "答案2"},
]
------
我们开始吧!
{text}
"""3. Document Processing
Python code loads a text file and splits it into manageable chunks using LangChain utilities:
import json
from typing import List
from tqdm import tqdm
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.output_parsers import JsonOutputParser
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_document(filepath):
loader = UnstructuredFileLoader(filepath)
text_spliter = RecursiveCharacterTextSplitter(
chunk_size=2048,
chunk_overlap=128
)
documents = loader.load_and_split(text_spliter)
return documents4. Building the Chain
The chain combines the system/human prompts, the LLM, and a JSON output parser:
def create_chain():
prompt = ChatPromptTemplate.from_messages([
("system", QA_PAIRS_SYSTEM_PROMPT),
("human", QA_PAIRS_HUMAN_PROMPT)
])
llm = AzureChatOpenAI(
azure_endpoint=endpoint,
deployment_name=deployment_name,
openai_api_key=api_key,
openapi_api_version="2024-02-01",
)
parser = JsonOutputParser(pydantic_object=QaPairs)
chain = prompt | llm | parser
return chain5. Fine‑tuning the Model
After generating the dataset (example shown with "The Little Prince"), the article uses the peft library to apply LoRA to the Phi‑3‑mini‑4k‑instruct model.
from peft import LoraConfig, TaskType
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import get_peft_model
peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()Dataset tokenization and training configuration:
def tokenize_function(example):
encoded = tokenizer(example['question'], truncation=True, padding='max_length', max_length=128)
encoded["labels"] = tokenizer(example["answer"], truncation=True, padding="max_length", max_length=128)["input_ids"]
return encoded
data_files = {"train": "train.json", "validation": "train.json"}
dataset = load_dataset('./dataset', data_files=data_files)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(
output_dir="outputs",
learning_rate=1e-3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=2,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
tokenizer=tokenizer,
)
trainer.train()
model.save_pretrained("outputs")6. Inference with the LoRA‑adapted Model
Loading the LoRA adapter and generating answers:
model.load_adapter('outputs', adapter_name='lora01')
model.set_adapter("lora01")
model.eval()
inputs = tokenizer("作者小时候看了一本关于什么的书?", return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])The article concludes that the fine‑tuned model can be used just like the original AutoModelForCausalLM for downstream tasks.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.