Why FP32 Remains the Benchmark for Measuring AI Compute Power
This article explains scientific notation, the IEEE‑754 floating‑point standard, the structure of FP32 and FP64 numbers, and how computational power is measured using FLOPS, illustrating CPU and GPU FP32 performance calculations and why FP32 is the common benchmark for AI workloads.
Scientific Notation
Scientific notation is designed to represent very large or very small numbers, using the form a × 10ⁿ where a is in the range [1,10). For example, 19971400000000 can be written as 1.99714 × 10¹³ and 0.00001 as 1 × 10⁻⁵.
It is widely applied in science, engineering, and mathematics to simplify the writing and calculation of extreme values.
Simplified representation : Large or small numbers can be expressed concisely, avoiding many zeros.
Convenient calculation : Scientific notation simplifies arithmetic, especially for extreme values.
Easy comparison : Comparing exponents quickly reveals magnitude differences.
In computer science the notation often uses E or e, e.g., 1.2E4 or 5e-3.
Floating‑Point Numbers
Floating‑point numbers are decimals whose point can “float”, based on scientific notation. For example, the decimal 0.1234 can be represented in many ways.
1.234 = 1.234 * 10^0
1.234 = 12.34 * 10^-1
1.234 = 123.4 * 10^-2
...IEEE 754 Floating‑Point Standard
The IEEE 754 standard (IEEE Standard for Floating‑Point Arithmetic) defines how floating‑point numbers are represented, stored, and computed.
Reference: https://zh.wikipedia.org/wiki/IEEE_754
Floating‑point format : consists of four parts.
Sign bit (S) : 0 for positive, 1 for negative.
Mantissa (M) : the significant digits.
Radix (R) : the base of the numeral system, 2 for binary.
Exponent (E) : the power of the radix.
The complete representation of a value V is: V = (-1)^S * M * R^E Floating‑point types : two main categories.
Single‑precision (FP32, 32‑bit): 1 sign bit, 8 exponent bits, 23 mantissa bits.
Double‑precision (FP64, 64‑bit): 1 sign bit, 11 exponent bits, 52 mantissa bits.
To store exponent E as an unsigned integer, IEEE 754 introduces a bias. For FP32 the bias is 127; for FP64 it is 1023.
FP32 example: actual E = –1 → stored E = 127 – 1 = 126; actual E = 1 → stored E = 127 + 1 = 128.
FP64 example: bias = 1023.
Using bias allows the exponent to be stored as an unsigned integer, simplifying exponent comparisons.
Converting Decimal to FP32/FP64
Example: converting decimal 0.125 to IEEE 754 FP32 binary.
Convert 0.125 to binary: 0.001.
1) Integer part: division by 2 → remainder 0
2) Fraction part: multiply by 2
- 0.125×2 = 0.25 → integer 0
- 0.25×2 = 0.5 → integer 0
- 0.5×2 = 1.0 → integer 1Scientific notation: 1.0 × 2⁻³.
FP32 storage:
- S: 0 (positive)
- M: exponent bias 127, actual E = –3 → stored E = 124 = 01111100₂
- E: mantissa 1.0 → implicit leading 1 omitted, remaining 23 bits are zerosFinal binary representation of 0.125 in FP32: 0 01111100 00000000000000000000000.
For FP64, the binary representation is 0 01111111100 0010000000000000000000000000000000000000000000000000.
Meaning of Computational Power
Computational power (FLOPS) measures the number of operations a system can perform per second, covering CPUs, GPUs, TPUs, NPUs, etc.
Units : e.g., Peta‑FLOPS (10¹⁵ operations per second).
Types :
OPS (Operations Per Second) : integer operations such as INT8, INT16.
FLOPS (Floating‑point Operations Per Second) : floating‑point operations such as FP32, FP64.
Levels :
Kilo: 10³
Mega: 10⁶
Giga: 10⁹
Tera: 10¹²
Peta: 10¹⁵
Exa: 10¹⁸
Why FP32 Is Commonly Used to Measure Compute Power
Different numeric formats have distinct precision, performance, and application scenarios.
INT8 : integer precision, ultra‑low power, high throughput; suited for inference, edge devices, image classification.
Advantage: very low power, high throughput.
Limitation: significant precision loss.
Use cases: inference, edge computing.
INT16 : larger range than INT8, moderate precision; used in sensor processing, mixed‑integer workloads.
Advantage: broader range.
Limitation: fewer use cases.
FP16/BF16 : reduced precision (~3.14), boosts speed and reduces memory; requires modern GPUs (e.g., A100).
Advantage: higher speed, lower memory.
Limitation: lower precision, hardware dependent.
Use cases: large‑model training, lightweight inference, mixed‑precision training.
FP32 : standard precision, balances accuracy and speed; typical value 3.141593; widely compatible.
Advantage: broad compatibility.
Limitation: higher memory and compute cost than lower‑precision formats.
Use cases: deep‑learning training, graphics rendering, general scientific computing.
FP64 : very high precision (e.g., 3.141592653589793); slower, higher power, cost; used for high‑precision scientific, financial, nuclear simulations.
Advantage: highest numeric precision.
Limitation: lower FLOPS, higher power and cost.
Use cases: high‑precision scientific calculations.
AI training often uses higher precision (FP32 or FP16), while inference may use lower precision (INT8). After training, models can be quantized to lower precision without significant accuracy loss, improving inference speed.
Modern GPUs also include Tensor Cores and support mixed‑precision training, combining FP16 and FP32 to retain accuracy while reducing memory and increasing speed.
Computational Power Calculation
CPU FP32
The theoretical FP32 performance of a single CPU equals: CPU cores × core frequency × FP32 operations per cycle.
CPU cores : number of cores.
Core frequency : GHz (1 GHz = 10⁹ cycles/s).
FP32 ops per cycle : FLOPS per cycle.
Example: Intel Xeon Platinum 8280 has 28 cores, 2.7 GHz, AVX‑512 vector width 512 bits.
FP32 ops per cycle = 2 FMA × 2 MA × (512 / 32) = 64 FLOPS/Cycle.
2 FMA: two fused multiply‑add per cycle.
2 MA: each FMA yields two FLOPs.
512 / 32 = 16 FP32 values processed per vector.
Peak FP32 per core = 64 × 2.7 × 10⁹ = 172.8 × 10⁹ FLOPS.
Total CPU peak = 28 × 172.8 × 10⁹ = 4.8384 × 10¹² FLOPS ≈ 4.8 Tera‑FLOPS.
GPU FP32
Example: NVIDIA V100 GPU with 5120 CUDA cores, 1.53 GHz, FP32 FFMA instructions, 32‑bit vector width.
FP32 ops per cycle = 1 FMA × 2 MA × (32 / 32) = 2 FLOPS/Cycle.
Peak FP32 per CUDA core = 2 × 1.53 × 10⁹ = 3.06 × 10⁹ FLOPS.
Total GPU peak = 5120 × 3.06 × 10⁹ = 15.6672 × 10¹² FLOPS ≈ 15.7 Tera‑FLOPS.
The Intel Xeon Platinum 8280 CPU delivers about 4.8 Tera‑FLOPS, while the NVIDIA V100 GPU delivers about 15.7 Tera‑FLOPS, reflecting their different application domains.
References
https://www.intel.cn/content/www/cn/zh/products/details/processors/xeon/xeon6-p-cores.html
https://www.intel.cn/content/www/cn/zh/products/docs/accelerator-engines/what-is-intel-amx.html
https://mp.weixin.qq.com/s/0PSYQOV0kVC1b9I_hlpUXg
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
