Why Fixed‑Point vs Floating‑Point Matters: Inside the New FP8 Formats
This article explains how integers and decimals are stored using fixed‑point and floating‑point representations, introduces the 8‑bit FP8 formats (E4M3 and E5M2) used in modern AI hardware, and traces the evolution from classic FP32/FP64 to the latest ultra‑compact numeric types.
1. Fixed‑Point Numbers
To store the integer 66, we convert it to binary: 0100 0010 . An 8‑bit binary integer is called INT8 .
For a decimal like 13.25, we convert the integer part and fractional part separately, yielding 1101.0100 . Fixed‑point representation fixes the decimal point position, e.g., the first four bits for the integer and the last four bits for the fraction.
This approach is simple but inflexible: placing the decimal point too far left limits the range, while placing it too far right reduces precision.
2. Floating‑Point Numbers
Floating‑point uses scientific notation, allowing the decimal point to “float.” For example, the number 235 can be expressed with a three‑digit mantissa and a one‑digit exponent (e.g., 2.35 × 10⁻³). Adjusting the exponent changes precision or range: a smaller exponent yields higher precision, a larger exponent expands the representable range.
The floating‑point format consists of a sign bit, exponent bits, and mantissa bits, enabling both large ranges and high relative precision.
3. FP8
FP8 is an 8‑bit floating‑point format with 1 sign bit, 4 exponent bits, and 3 mantissa bits, called Float Point 8 (FP8) . The specific allocation of exponent and mantissa bits defines different FP8 variants, such as E4M3 (4‑bit exponent, 3‑bit mantissa) and E5M2 (5‑bit exponent, 2‑bit mantissa). These formats are supported and hardware‑optimized by NVIDIA GPUs.
A more extreme variant removes the mantissa and sign bits, using an 8‑bit biased exponent only, enabling fast shift‑based calculations for speed‑critical scenarios.
4. Historical Background
The most common floating‑point formats are FP32 (single‑precision) and FP64 (double‑precision), with layouts 1‑8‑23 and 1‑11‑52 bits respectively. As deep learning grew, 32‑bit floats became too large, leading to FP16 and BF16 (16‑bit variants). The push for larger models drove the development of FP8 formats, including the E4M3 and E5M2 variants.
These formats are used for storing weights, gradients, or scaling factors in large‑scale AI models, offering a trade‑off between precision and memory efficiency.
5. Final Thoughts
The author reflects that current aggressive optimizations for large models resemble early computer days when memory was scarce and programmers squeezed every bit of performance, suggesting that today's meticulous numeric tricks may become a historical footnote.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
