Artificial Intelligence 13 min read

How Large Model Training Dominates Compute and What New Techniques Can Change It

This article explains why pre‑training large AI models consumes 90‑99% of total compute, describes the full training and inference pipelines, introduces resource‑saving strategies such as PD‑separation, and reviews market trends and infrastructure challenges shaping the next generation of AI systems.

Architects' Tech Alliance

Aug 18, 2025

How Large Model Training Dominates Compute and What New Techniques Can Change It

Value of Pre‑training Large Models

Pre‑training large models extracts deep, general knowledge from massive, diverse corpora, dramatically improving fine‑tuning efficiency, model generalisation, and reducing both compute and development costs. Subsequent fine‑tuning, reward‑model training and PPO reinforcement learning incorporate human preference feedback to align models with user intent, safety and dialogue quality.

Large‑Model Training Layer – Full Process Framework

The training framework first builds basic capabilities through pre‑training and supervised fine‑tuning, then uses human‑feedback reinforcement learning to achieve critical value alignment.

Resource Consumption in the Training Stage

The pre‑training phase dominates resource use, requiring thousands to tens of thousands of GPUs, processing trillions of tokens over weeks to months, accounting for 90‑99% of total compute.

Examples:

GPT‑3 used ~6,000 A100 GPUs for ~34 days of pre‑training and 8 days of fine‑tuning (total 42 days).

LLaMA‑1 trained on ~2,028 GPUs for 90 days on 1‑1.4 trillion tokens.

LLaMA‑2 completed 2 trillion‑token pre‑training in 42 days.

LLaMA‑3 employed ~16,384 H100 GPUs for 54 days on 15 trillion tokens.

Large‑Model Inference Layer – Process Framework

Inference starts with tokenisation and embedding, passes vectors through multiple Transformer self‑attention layers, uses KV‑cache to boost performance, and finally generates token probabilities that are post‑processed into coherent text.

Key Inference Parameters

Post‑processing techniques such as temperature sampling, Top‑k/Top‑p truncation and greedy selection balance diversity, coherence and stability of generated output.

Core Inference Stages

Inference consists of two stages: a parallel Prefill that processes the entire input context at once and builds KV‑cache, and an incremental Decode that generates tokens one by one using the cached context.

PD‑Separation Technique

PD‑separation decouples Prefill and Decode, allowing each stage to be optimally scheduled: Prefill benefits from batch merging and model parallelism for high GPU throughput, while Decode leverages KV‑cache, memory‑bandwidth optimisations and specialised pipelines to minimise latency and improve token‑per‑second throughput.

Market Insight – Large‑Model Landscape

The global large‑model market shows “technology convergence, business divergence”: all vendors pursue native multimodal capabilities, but split into closed‑source platform ecosystems and open‑source community ecosystems.

Two commercial paths emerge: (1) platform‑centric, high‑value, high‑stickiness MaaS models; (2) open‑source, community‑driven models such as Llama, Qwen and DeepSeek that aim for broad adoption.

As applications explode, demand concentrates on cloud providers with massive infrastructure and diverse use‑cases.

Infrastructure Layer – Intelligent Computing Center Foundations

Core infrastructure includes power distribution, cooling, rack design, cabling, lightning and fire protection, all coordinated to ensure high availability and stable operation of compute equipment.

GPU Chip Power Increase

New‑generation GPUs (Ampere → Hopper → Blackwell) feature adjustable precision, enhanced interconnects and dramatically higher power envelopes, delivering exponential gains in FP16/INT8/FP8 performance, bandwidth (up to 16 TB/s) and memory capacity, while power consumption rises to 2.7 kW per chip.

Cost Impact Factors

Intelligent‑computing‑center cost is driven by customer requirements, technical solutions, redundancy design, scale, location and equipment selection, resulting in highly customised and variable system‑level expenses.

inference optimization AI infrastructure AI training GPU Architecture

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.