Baobao Algorithm Notes
Baobao Algorithm Notes
Aug 4, 2025 · Artificial Intelligence

Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size

This article analyzes the surprising design choices of the rumored GPT‑OSS 120B model, explaining the rationale behind a 64‑dimensional attention head, the equal hidden and intermediate sizes, and other quirky parameters such as MLP bias and KV‑sink SWA, backed by theoretical formulas and empirical benchmarks.

Attention HeadGPT-OSSMLP Ratio
0 likes · 13 min read
Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size