How to Estimate Hardware Costs for Large-Model Fine-Tuning and Training (Interview Classic #1)
The article explains how to estimate GPU memory and overall hardware requirements for fine-tuning and training large dense and MoE models, detailing calculations for full-parameter and LoRA approaches, scaling rules, and hidden costs relevant to interview assessments.
Problem analysis
Evaluating hardware cost for large‑model fine‑tuning requires quick, accurate estimation of GPU memory and time. The assessment differs from pure algorithm questions because it tests overall engineering control of a fine‑tuning task.
Standard answer – Dense models
Full‑parameter fine‑tuning
Memory is approximated by parameter count × precision bit‑width. For a 70 B‑parameter dense model stored in FP16: Parameter count: 70 000 000 000 Bits per parameter: 16 bit Bits per GB: 1 GB = 1024³ × 8 bit
Storage memory ≈ 70 000 000 000 × 16 ÷ (1024³ × 8) ≈ 140 GB.
During training, gradient tensors occupy a similar amount (≈140 GB). Optimizer state (e.g., AdamW) stores first‑ and second‑order moments for each parameter, typically adding four times the model memory → ≈560 GB. Adding activation memory, fragmentation, and distributed‑training overhead pushes the practical requirement to about 1 TB of GPU memory.
For smaller dense models, memory scales roughly linearly. A 13 B model (≈1/5 of 70 B) needs ≈200 GB. Quantizing to 8 bit halves the storage size (≈70 GB) and 4 bit quarters it (≈35 GB), but gradients and optimizer states usually remain in FP16, limiting overall savings.
LoRA fine‑tuning
LoRA freezes the base model and updates only low‑rank adapters. For a 70 B model, trainable parameters are only 1‑2 % of the total, reducing memory to roughly 160 GB.
MoE models
Mixture‑of‑Experts (MoE) splits a large model into multiple expert sub‑networks; only a few experts are activated per input, keeping compute roughly constant while increasing total parameters.
Example: Qwen3‑235B‑A22B has 22 B active parameters and 7.8 B shared parameters.
Full‑parameter fine‑tuning
Treat the MoE as an equivalent dense model with shared + active parameters. The 22 B active + 7.8 B shared ≈ 30 B‑equivalent dense model, requiring ≈500 GB GPU memory (practically ≈600 GB with offloading).
LoRA fine‑tuning for MoE
Efficient LoRA can limit updates to selected experts, needing about 110 GB GPU memory. Exact active and shared parameter counts must be obtained from the official MoE documentation.
Related hot questions
Engineering challenges of MoE in enterprise environments
Beyond expert‑routing balance and high‑speed interconnects, challenges include cross‑GPU communication optimization, dynamic expert load scheduling, fault tolerance, and monitoring.
Prioritizing hardware resources on a limited budget
First ensure sufficient GPU memory capacity and bandwidth; then consider the number of GPUs—single high‑memory cards often outperform multiple low‑memory cards for small models, though they are costlier.
Hidden costs beyond GPUs when evaluating hardware budget
Additional considerations: power and cooling, rack space, cluster operations and personnel costs, and distributed communication overhead.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
