How to Efficiently Fine‑Tune Llama 3 on a Free Colab T4 GPU with Unsloth

This article provides a step‑by‑step, code‑rich tutorial for fine‑tuning the open‑source Llama 3 1B and 3B models on Google Colab using the Unsloth library and LoRA, covering environment setup, model loading, adapter insertion, dataset preparation, training configuration, inference, and model saving, all while keeping GPU memory usage low.

Ops Development & AI Practice
Ops Development & AI Practice
Ops Development & AI Practice
How to Efficiently Fine‑Tune Llama 3 on a Free Colab T4 GPU with Unsloth

What is a Large Language Model (LLM)?

LLMs are massive neural networks trained on vast text corpora; they can answer questions, write articles, translate languages, and hold conversations, much like a highly knowledgeable parrot.

Answer questions: act as an expert assistant.

Write articles: generate news, stories, code, etc.

Translate languages: convert text between languages.

Hold dialogues: function as a chatbot.

Why Fine‑Tune?

Although Llama 3 is powerful out‑of‑the‑box, it may not perform optimally on specialized tasks such as customer‑service chats or poetry generation. Fine‑tuning re‑educates the model on domain‑specific examples, improving performance for the target task.

Unsloth: The Speed‑Boosting Library

Unsloth accelerates LLM training by up to 30×, drastically reducing GPU memory consumption and simplifying the code base. Its speed gains stem from optimizations like Flash Attention‑2 and 4‑bit quantization.

Faster: training can be many times quicker than using standard Hugging Face Transformers.

Memory‑efficient: GPU memory requirements are dramatically lowered.

Easy to use: concise, clean API.

LoRA: Lightweight Fine‑Tuning

LoRA (Low‑Rank Adaptation) updates only a small subset of model parameters, making fine‑tuning fast and resource‑light while often matching full‑parameter performance.

Efficient: training is quick and uses little memory.

High performance: LoRA can achieve results comparable to full fine‑tuning.

Hands‑On: Fine‑Tuning Llama 3 with Unsloth

A complete, runnable guide for Google Colab.

Environment Setup

%%capture
# Colab special install to avoid PyTorch issues
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft "trl<0.15.0" triton
!pip install --no-deps cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth

Important note: use the special install command instead of a plain pip install unsloth to avoid PyTorch conflicts.

Load Model and Tokenizer

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # maximum sequence length
dtype = None            # auto‑detect dtype
load_in_4bit = True     # enable 4‑bit quantization

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",  # or "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

The FastLanguageModel.from_pretrained function loads the chosen Llama 3 variant and its tokenizer.

Add LoRA Adapter

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=[
        "q_proj","k_proj","v_proj","o_proj",
        "gate_proj","up_proj","down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

This adds a lightweight LoRA adapter; only the specified projection layers are trained.

Prepare Dataset

from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt, get_chat_template

dataset = load_dataset("mlabonne/FineTome-100k", split="train")
dataset = standardize_sharegpt(dataset)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        )
        for convo in convos
    ]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

The ShareGPT‑style data are converted to the Hugging Face format and tokenized with the Llama 3 chat template.

Train the Model

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,               # demonstration only
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>

",
    response_part="<|start_header_id|>assistant<|end_header_id|>

",
)
trainer_stats = trainer.train()

Key hyper‑parameters such as batch size, gradient accumulation, learning rate, and 8‑bit Adam optimizer are shown; only 60 steps are run for a quick demo.

Inference

from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
FastLanguageModel.for_inference(model)  # enable 2× faster inference

messages = [{"role":"user","content":"Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8,"}]
inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=64,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)
print(tokenizer.batch_decode(outputs))

Uses TextStreamer for streaming output and the native 2× inference mode provided by Unsloth.

Save and Reload

model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token="...")
# tokenizer.push_to_hub("your_name/lora_model", token="...")

The fine‑tuned LoRA weights can be stored locally or pushed to the Hugging Face Hub.

Conclusion

This guide demonstrates that, by combining Unsloth’s memory‑efficient kernels with LoRA’s lightweight adaptation, even the 1 B‑ or 3 B‑parameter Llama 3 models can be fine‑tuned on a free Google Colab T4 GPU, producing a responsive conversational AI without requiring expensive hardware.

AIFine-tuningLoRAGPULlama 3ColabUnsloth
Ops Development & AI Practice
Written by

Ops Development & AI Practice

DevSecOps engineer sharing experiences and insights on AI, Web3, and Claude code development. Aims to help solve technical challenges, improve development efficiency, and grow through community interaction. Feel free to comment and discuss.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.