How to Estimate Hardware Costs for Large-Model Fine-Tuning and Training (Interview Classic #1)

The article explains how to estimate GPU memory and overall hardware requirements for fine-tuning and training large dense and MoE models, detailing calculations for full-parameter and LoRA approaches, scaling rules, and hidden costs relevant to interview assessments.

Fun with Large Models
Fun with Large Models
Fun with Large Models
How to Estimate Hardware Costs for Large-Model Fine-Tuning and Training (Interview Classic #1)

Problem analysis

Evaluating hardware cost for large‑model fine‑tuning requires quick, accurate estimation of GPU memory and time. The assessment differs from pure algorithm questions because it tests overall engineering control of a fine‑tuning task.

Standard answer – Dense models

Full‑parameter fine‑tuning

Memory is approximated by parameter count × precision bit‑width. For a 70 B‑parameter dense model stored in FP16: Parameter count: 70 000 000 000 Bits per parameter: 16 bit Bits per GB: 1 GB = 1024³ × 8 bit

Storage memory ≈ 70 000 000 000 × 16 ÷ (1024³ × 8) ≈ 140 GB.

During training, gradient tensors occupy a similar amount (≈140 GB). Optimizer state (e.g., AdamW) stores first‑ and second‑order moments for each parameter, typically adding four times the model memory → ≈560 GB. Adding activation memory, fragmentation, and distributed‑training overhead pushes the practical requirement to about 1 TB of GPU memory.

For smaller dense models, memory scales roughly linearly. A 13 B model (≈1/5 of 70 B) needs ≈200 GB. Quantizing to 8 bit halves the storage size (≈70 GB) and 4 bit quarters it (≈35 GB), but gradients and optimizer states usually remain in FP16, limiting overall savings.

LoRA fine‑tuning

LoRA freezes the base model and updates only low‑rank adapters. For a 70 B model, trainable parameters are only 1‑2 % of the total, reducing memory to roughly 160 GB.

MoE models

Mixture‑of‑Experts (MoE) splits a large model into multiple expert sub‑networks; only a few experts are activated per input, keeping compute roughly constant while increasing total parameters.

Example: Qwen3‑235B‑A22B has 22 B active parameters and 7.8 B shared parameters.

Full‑parameter fine‑tuning

Treat the MoE as an equivalent dense model with shared + active parameters. The 22 B active + 7.8 B shared ≈ 30 B‑equivalent dense model, requiring ≈500 GB GPU memory (practically ≈600 GB with offloading).

LoRA fine‑tuning for MoE

Efficient LoRA can limit updates to selected experts, needing about 110 GB GPU memory. Exact active and shared parameter counts must be obtained from the official MoE documentation.

Related hot questions

Engineering challenges of MoE in enterprise environments

Beyond expert‑routing balance and high‑speed interconnects, challenges include cross‑GPU communication optimization, dynamic expert load scheduling, fault tolerance, and monitoring.

Prioritizing hardware resources on a limited budget

First ensure sufficient GPU memory capacity and bandwidth; then consider the number of GPUs—single high‑memory cards often outperform multiple low‑memory cards for small models, though they are costlier.

Hidden costs beyond GPUs when evaluating hardware budget

Additional considerations: power and cooling, rack space, cluster operations and personnel costs, and distributed communication overhead.

Fine-tuningLoRAMixture of Expertslarge modelGPU memoryhardware cost
Fun with Large Models
Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.