How to Efficiently Fine‑Tune Llama 3 on a Free Colab T4 GPU with Unsloth
This article provides a step‑by‑step, code‑rich tutorial for fine‑tuning the open‑source Llama 3 1B and 3B models on Google Colab using the Unsloth library and LoRA, covering environment setup, model loading, adapter insertion, dataset preparation, training configuration, inference, and model saving, all while keeping GPU memory usage low.
What is a Large Language Model (LLM)?
LLMs are massive neural networks trained on vast text corpora; they can answer questions, write articles, translate languages, and hold conversations, much like a highly knowledgeable parrot.
Answer questions: act as an expert assistant.
Write articles: generate news, stories, code, etc.
Translate languages: convert text between languages.
Hold dialogues: function as a chatbot.
Why Fine‑Tune?
Although Llama 3 is powerful out‑of‑the‑box, it may not perform optimally on specialized tasks such as customer‑service chats or poetry generation. Fine‑tuning re‑educates the model on domain‑specific examples, improving performance for the target task.
Unsloth: The Speed‑Boosting Library
Unsloth accelerates LLM training by up to 30×, drastically reducing GPU memory consumption and simplifying the code base. Its speed gains stem from optimizations like Flash Attention‑2 and 4‑bit quantization.
Faster: training can be many times quicker than using standard Hugging Face Transformers.
Memory‑efficient: GPU memory requirements are dramatically lowered.
Easy to use: concise, clean API.
LoRA: Lightweight Fine‑Tuning
LoRA (Low‑Rank Adaptation) updates only a small subset of model parameters, making fine‑tuning fast and resource‑light while often matching full‑parameter performance.
Efficient: training is quick and uses little memory.
High performance: LoRA can achieve results comparable to full fine‑tuning.
Hands‑On: Fine‑Tuning Llama 3 with Unsloth
A complete, runnable guide for Google Colab.
Environment Setup
%%capture
# Colab special install to avoid PyTorch issues
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft "trl<0.15.0" triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unslothImportant note: use the special install command instead of a plain pip install unsloth to avoid PyTorch conflicts.
Load Model and Tokenizer
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # maximum sequence length
dtype = None # auto‑detect dtype
load_in_4bit = True # enable 4‑bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct", # or "unsloth/Llama-3.2-1B-Instruct"
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)The FastLanguageModel.from_pretrained function loads the chosen Llama 3 variant and its tokenizer.
Add LoRA Adapter
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=[
"q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False,
loftq_config=None,
)This adds a lightweight LoRA adapter; only the specified projection layers are trained.
Prepare Dataset
from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, get_chat_template
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [
tokenizer.apply_chat_template(
convo, tokenize=False, add_generation_prompt=False
)
for convo in convos
]
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)The ShareGPT‑style data are converted to the Hugging Face format and tokenized with the Llama 3 chat template.
Train the Model
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
dataset_num_proc=2,
packing=False,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60, # demonstration only
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
report_to="none",
),
)
trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>
",
response_part="<|start_header_id|>assistant<|end_header_id|>
",
)
trainer_stats = trainer.train()Key hyper‑parameters such as batch size, gradient accumulation, learning rate, and 8‑bit Adam optimizer are shown; only 60 steps are run for a quick demo.
Inference
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
FastLanguageModel.for_inference(model) # enable 2× faster inference
messages = [{"role":"user","content":"Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8,"}]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=64,
use_cache=True,
temperature=1.5,
min_p=0.1,
)
print(tokenizer.batch_decode(outputs))Uses TextStreamer for streaming output and the native 2× inference mode provided by Unsloth.
Save and Reload
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token="...")
# tokenizer.push_to_hub("your_name/lora_model", token="...")The fine‑tuned LoRA weights can be stored locally or pushed to the Hugging Face Hub.
Conclusion
This guide demonstrates that, by combining Unsloth’s memory‑efficient kernels with LoRA’s lightweight adaptation, even the 1 B‑ or 3 B‑parameter Llama 3 models can be fine‑tuned on a free Google Colab T4 GPU, producing a responsive conversational AI without requiring expensive hardware.
Ops Development & AI Practice
DevSecOps engineer sharing experiences and insights on AI, Web3, and Claude code development. Aims to help solve technical challenges, improve development efficiency, and grow through community interaction. Feel free to comment and discuss.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
