Artificial Intelligence 8 min read

How to Ace LLM Interview Questions: Deep Dive into Pre‑training, SFT, DPO & RLHF

This guide breaks down the four major large‑model training paradigms—pre‑training, supervised fine‑tuning, preference alignment, and RLHF—explaining which parameters are updated, how attention is reshaped, and what capabilities are gained, so you can deliver a structured, interview‑ready answer.

AI Large-Model Wave and Transformation Guide

Mar 28, 2026

How to Ace LLM Interview Questions: Deep Dive into Pre‑training, SFT, DPO & RLHF

Introduction

When interviewing for large‑model positions, candidates often face questions such as “What is the difference between pre‑training and fine‑tuning?” or “What does RLHF actually train?” This article provides a clear, three‑layer framework and a step‑by‑step analysis of four training methods to help you answer confidently.

"Pre‑training and fine‑tuning? RLHF? DPO vs. RLHF?"

Core Insight: Three‑Layer Pyramid

The author visualizes LLM training as a pyramid:

Knowledge/Capability – stored as statistical patterns in parameters (the top of the pyramid).

Attention Mechanism – a routing mechanism controlled by parameters.

Parameters – the only entity directly updated by gradient descent.

Key insight: regardless of the training style, the sole object of gradient descent is the parameters; attention is merely a mechanism shaped by those parameters, and knowledge is the distribution of patterns encoded in them.

Four Training Paradigms Explained

1. Pre‑training (Pre‑training)

Training parameters: all parameters (Q/K/V projections, FFN, LayerNorm, Embedding).

What attention learns: global semantic, syntactic, and long‑range dependencies.

What is learned: world knowledge, language rules, and general reasoning—compressing human civilization into 175 billion numbers.

One‑sentence summary: massive text updates all parameters, making attention model semantics and compressing world knowledge into weights.

2. Supervised Fine‑tuning (SFT)

Training parameters: either full‑parameter fine‑tuning or low‑rank adapters such as LoRA/QLoRA.

Attention reshaping: focuses on instruction keywords (e.g., “summarize”, “analyze”), ignores irrelevant history, and adopts task‑oriented patterns (entity extraction → entity focus; summarization → key‑sentence focus).

Capabilities gained: instruction‑following, task‑format compliance, and domain‑specific factual knowledge.

One‑sentence summary: locally optimizes parameters, reshapes attention toward task‑oriented distribution, teaching the model to “listen” and perform tasks.

3. Preference Alignment (DPO/IPO/KTO)

Training parameters: typically only a small set via LoRA.

Attention correction: suppresses hallucination‑related activations, strengthens fact‑based attention, and aligns with genuine user intent.

Alignment goals: human preference, safety boundaries, answer style, and logical consistency.

One‑sentence summary: without adding new knowledge, fine‑tunes a few parameters to correct attention bias, making outputs follow human values.

4. Reinforcement Learning from Human Feedback (RLHF)

Training parameters: policy‑gradient updates to the model’s backbone.

Attention reinforcement: high‑reward responses strengthen their attention pathways; low‑reward responses are suppressed.

Optimization objective: more natural dialogue style and expressions that align with human preferences.

One‑sentence summary: uses reward signals to guide parameter updates, amplifying high‑quality attention routes without learning new facts.

Comparative Summary of Training Stages

Pre‑training: updates all parameters; attention becomes globally semantic; learns world knowledge, language rules, and reasoning.

SFT: updates all or low‑rank parameters; attention becomes task‑oriented; learns instruction following and task formats.

Alignment (DPO/IPO/KTO): updates only LoRA parameters; attention is corrected toward factual grounding and safety; learns human preference and safety alignment.

RLHF: updates backbone via policy gradient; attention is reinforced for high‑reward paths; learns natural dialogue style and preference‑aligned expression.

Interview‑Ready Conclusions

Regardless of the training method, the only thing truly being trained is the parameters.

Attention is not directly trained; it is a routing mechanism controlled by parameters, and training changes how attention is allocated.

Knowledge is the statistical regularities encoded into parameters during pre‑training; later stages usually do not add new knowledge, only adjust how it is accessed.

The differences among training methods lie in which parameters are updated, how attention bias is adjusted, and which capabilities are optimized.

Final Tip for the Interview

When the interviewer asks “What’s the difference between pre‑training and fine‑tuning?”, respond with a structured answer:

“Pre‑training uses massive data to update all parameters, encoding world knowledge into the weights; fine‑tuning uses a small dataset to locally optimize parameters and reshape attention toward task‑specific goals, teaching the model to follow specific instructions. Both update parameters, but the scale, objective, and attention distribution differ completely.”

This concise, layered response demonstrates deep understanding and earns full marks.

LLM large language models Fine-tuning RLHF pretraining AI Interview Model Alignment