How to Start Training Your Own AI Model: A Complete Roadmap

This guide maps the end-to-end process for building a small AI model—from leveraging open-source base models and applying SFT with LoRA/QLoRA, through alignment techniques like DPO or ORPO, to low-cost distillation and final quantization for local deployment, while recommending free GPU resources and essential tooling.

AI Engineer Programming
AI Engineer Programming
AI Engineer Programming
How to Start Training Your Own AI Model: A Complete Roadmap

Pre‑training (Stage 0)

Open‑source large models such as Llama, Qwen and DeepSeek are released after a massive pre‑training (PT) phase that consumes millions of dollars and trillions of tokens, so you should start from an existing base model rather than training from scratch.

Supervised Fine‑tuning (SFT) – Stage 1

Base models can only continue a prompt; they do not understand questions. SFT (also called Instruction Fine‑Tuning, IFT) uses paired (instruction, desired answer) data so the model learns to answer directly. For domain‑specific knowledge or style, SFT alone is often sufficient.

When is SFT enough? If the goal is to embed specialized knowledge (e.g., legal Q&A, customer‑service) and the response quality is stable, no further alignment is required.

Data requirements vary: a few hundred examples may suffice for simple tasks, while thousands are needed for complex ones.

Parameter‑efficient Fine‑tuning – LoRA & QLoRA

Full‑parameter fine‑tuning of a 7 B model needs ≥80 GB VRAM, which is expensive. LoRA inserts low‑rank matrices A × B beside frozen weights W, updating only the small adapters.

LoRA : keeps original weights in fp16/bf16, requires 14‑16 GB VRAM for a 7 B model.

QLoRA : quantises the base weights to 4‑bit, then adds LoRA; using Unsloth it runs in 6‑8 GB VRAM, while native HuggingFace needs 10‑12 GB.

QLoRA + Unsloth reduces memory by 40‑70 % and speeds training 2‑5×; a T4 (16 GB) can handle a 7 B model.

Alignment Training – Stage 2

After SFT the model may produce inconsistent answers. Alignment aims to make outputs consistently follow human preferences by presenting “good vs. bad” answer pairs.

RLHF (three‑stage) : SFT → Reward Model (RM) → PPO. Requires separate policy, reference, reward, and critic models; memory‑intensive and mainly used by large companies.

Alternative methods with lower overhead:

DPO : no separate reward model; optimises preference directly with a policy and a reference model. Medium VRAM, suitable for individuals or small teams.

ORPO : merges SFT and preference optimisation in one step, no reference model needed. Low VRAM, prioritises workflow simplification.

GRPO : replaces the critic with a group‑output average score, reducing memory to medium‑low. Best for tasks where results can be automatically verified (e.g., math, code).

KTO : only needs binary good/bad labels per answer, no paired data. Low VRAM, useful when annotation resources are limited.

GRPO is effective for verification‑ready tasks because it avoids a separate reward network, but it is not a drop‑in replacement for DPO.

Distillation – Stage 3 (Optional)

Distillation trains a smaller model by using high‑quality outputs from a large model as SFT data, eliminating any RL step. DeepSeek‑R1 generated ~800 k outputs and fine‑tuned 1.5 B, 7 B and 14 B student models, which then outperformed same‑size baselines on inference benchmarks.

When the downstream task is automatically verifiable, distillation is the preferred low‑cost way to gain reasoning ability; GRPO can be added later for further enhancement.

Free GPU Resources

Kaggle (2 × T4, ~30 h/week): run SFT + QLoRA; two cards can handle a 13 B model.

Google Colab (T4, 15 GB): run DPO or ORPO; Unsloth supports DPO within 15 GB for a 7 B model.

Local : merge LoRA adapters and quantise with llama.cpp to GGUF (Q4_K_M); a 7 B model occupies ~4.5 GB and runs on a standard laptop.

Tool Landscape (2026)

LLaMA‑Factory : most comprehensive UI, supports SFT/DPO/ORPO/GRPO/PPO for 100+ models; recommended for beginners.

Unsloth : fastest training, 40‑70 % VRAM saving, 2‑5× speedup; provides ready‑to‑run notebooks for free GPU environments.

TRL (HuggingFace): minimal code, offers SFTTrainer/DPOTrainer/GRPOTrainer, integrates tightly with the Transformers ecosystem.

Overall Training Path

Stage 0 – Skip pre‑training : use an open‑source base model (e.g., Llama‑3‑8B, Qwen2.5‑7B).

Stage 1 – SFT + QLoRA : collect instruction data, fine‑tune with Unsloth or LLaMA‑Factory.

Stage 2 – Alignment (optional) : add DPO for most cases, ORPO for simpler workflow, or KTO when only binary labels are available.

Stage 2+ – Reasoning boost (optional) : distill large‑model outputs; consider GRPO if the task’s results can be automatically verified.

Stage 3 – Quantisation : convert with llama.cpp to GGUF (Q4_K_M) for ~4.5 GB footprint and run locally.

This field evolves rapidly; new methods appear frequently. The current recommended pipeline is: SFT → Alignment → Distillation → Quantisation.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIQuantizationLoRAQLoRAAlignmentSFTDistillation
AI Engineer Programming
Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.