Re‑shaping Transformers: Moving Capacity Forward Makes LLMs Smarter
A new study shows that reallocating the feed‑forward network capacity toward the early layers of a Transformer—without adding parameters or FLOPs—lowers perplexity by up to 1.84 points, and the same technique improves performance across several modern LLM architectures.
