Quantitative Analysis of Transformer Architecture and Llama Model Performance
This engineering‑focused document reviews transformer fundamentals, derives precise FLOP and memory formulas for attention and feed‑forward layers, defines the MFU performance metric, analyzes memory components and parallelism strategies, examines recent architecture variants such as MQA, GQA, sliding‑window attention and MoE, and provides practice problems applying these calculations.
