Why GPT‑OSS Chooses a 64‑Dimensional Attention Head and 2880 Hidden Size
This article analyzes the surprising design choices of the rumored GPT‑OSS 120B model, explaining the rationale behind a 64‑dimensional attention head, the equal hidden and intermediate sizes, and other quirky parameters such as MLP bias and KV‑sink SWA, backed by theoretical formulas and empirical benchmarks.
Origin
A leaked architecture diagram of a rumored 120‑billion‑parameter model (GPT‑OSS) prompted an analysis of several unconventional hyper‑parameters.
1. Attention head dimension (head_dim = 64)
Scientific discussions propose a lower‑bound formula for head dimension: n > 8.33·log N, where n is the head dimension and N the sequence length. For a typical LLM sequence length of 4096, the bound is about 100, which explains why many models use 128.
GPT‑OSS adopts a sliding‑window attention (SWA) with window_size = 128. Because each local window processes only 128 tokens, the bound becomes roughly 64, providing a theoretical justification for the smaller head dimension and allowing more attention heads for parallelism.
2. Hidden size and intermediate size (MLP ratio)
Transformer FFNs normally expand the hidden dimension before compressing it. A larger intermediate dimension improves expressive power and reduces the probability of rank‑deficiency when using ReLU‑like activations.
Empirical work suggests an MLP ratio of about 8/3 ≈ 2.67. For a hidden size of 4096, this yields an intermediate size near 10,922. The following Python snippets illustrate the rank‑deficiency analysis and a TFLOPS benchmark on an A100 GPU.
# Rank‑deficiency probability (illustrative)
import math
def C(n,m):
return math.factorial(n)/(math.factorial(m)*math.factorial(n-m))
def rank_dec_ratio(n,m):
s=0
for i in range(m,n):
s+=C(n,i)
return 1 - s/(2**n)
for m in range(1,300):
print(m, rank_dec_ratio(2*m,m), rank_dec_ratio(3*m,m), rank_dec_ratio(4*m,m)) import torch
from tqdm import trange
# Model dimensions
d_hidden = 4096
# Base intermediate size using 8/3 ratio
d_ff_base = int(8/3 * d_hidden)
batch_size = 4 # 2**2
num_iterations = 100
distance = 100
def benchmark_bmm(b, m, n, k, num_iterations=100, num_matmuls=1):
A = torch.randn((b, m, n)).half().to("cuda:0")
B = torch.randn((b, n, k)).half().to("cuda:0")
C = torch.empty((b, m, k)).half().to("cuda:0")
warmup = 50
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
for i in range(warmup + num_iterations):
if i == warmup:
start.record()
with torch.no_grad():
for _ in range(num_matmuls):
torch.bmm(A, B, out=C)
end.record()
torch.cuda.synchronize()
elapsed = start.elapsed_time(end) / (1000 * num_iterations)
flops = (2 * b * m * n * k * num_matmuls) / (elapsed * 1e12)
return flops
print(f"Searching d_ff around {d_ff_base} ± {distance}")
results = {}
for delta in trange(-distance, distance, 4):
d_ff = d_ff_base + delta
d_ff -= d_ff % 4 # ensure multiple of 4
results[d_ff] = benchmark_bmm(batch_size, m=d_hidden, n=d_ff, k=d_hidden, num_iterations=num_iterations)
baseline = benchmark_bmm(batch_size, m=d_hidden, n=d_ff_base, k=d_hidden, num_iterations=num_iterations)
print("Baseline TFLOPS:", baseline)
for d_ff, tf in sorted(results.items(), key=lambda x: -x[1])[:5]:
print(f"d_ff={d_ff}, TFLOPS={tf:.2f}, MLP params={3*d_ff*d_hidden}")The benchmark confirms that an intermediate size around 10,922 (ratio ≈ 2.67) yields the highest TFLOPS for a hidden size of 4096. GPT‑OSS actually uses a Mixture‑of‑Experts (MoE) design with 4 experts per token, each expert having an FFN dimension of 2088, resulting in an effective MLP ratio of 4.
3. Additional observations
MLP bias : GPT‑OSS retains bias terms in its MLP layers, a detail that has become uncommon in recent large‑scale models.
KV‑sink sliding‑window attention : The model uses SWA with window_size = 128 across 36 layers, giving a receptive field of 4608 tokens (larger than the base sequence length of 4096). An additional KV‑sink token (4 tokens) is concatenated, similar to KV‑shifting techniques.
FP4 precision : The architecture reportedly employs the FP4 numeric format for weight storage.
Conclusion
The seemingly unconventional hyper‑parameters of GPT‑OSS are grounded in entropy‑based dimension bounds and hardware‑aware design choices such as windowed attention and an optimized MLP ratio. These choices, while counter‑intuitive at first glance, can be justified by the model’s internal architecture and performance trade‑offs.
References
[1] https://zhuanlan.zhihu.com/p/1915601328211759191
[2] https://kexue.fm/archives/8711
[3] https://kexue.fm/archives/10907
[4] https://www.zhihu.com/question/665731716/answer/1888209852712600269
[5] https://transformer-circuits.pub/2022/toy_model/index.html
[6] https://transformer-circuits.pub/2021/framework/index.html
[7] https://arxiv.org/abs/2401.14489
[8] https://x.com/YouJiacheng
[9] https://arxiv.org/abs/2411.19574
[10] https://zhuanlan.zhihu.com/p/1934745342089335268
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
