Baobao Algorithm Notes
Aug 4, 2025 · Artificial Intelligence
Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size
This article analyzes the surprising design choices of the rumored GPT‑OSS 120B model, explaining the rationale behind a 64‑dimensional attention head, the equal hidden and intermediate sizes, and other quirky parameters such as MLP bias and KV‑sink SWA, backed by theoretical formulas and empirical benchmarks.
Attention HeadGPT-OSSMLP Ratio
0 likes · 13 min read
