Artificial Intelligence 9 min read

Re‑shaping Transformers: Moving Capacity Forward Makes LLMs Smarter

A new study shows that reallocating the feed‑forward network capacity toward the early layers of a Transformer—without adding parameters or FLOPs—lowers perplexity by up to 1.84 points, and the same technique improves performance across several modern LLM architectures.

Machine Heart

Jun 29, 2026

Re‑shaping Transformers: Moving Capacity Forward Makes LLMs Smarter

Background

Since the 2017 "Attention Is All You Need" paper, most language models have used a uniform layer design: each layer receives the same amount of parameters, especially in the feed‑forward network (FFN) that stores and processes information. Recent research (Mila, Cornell, and University of Montreal) questions this "one‑size‑fits‑all" allocation.

Core Question

If the total number of parameters stays constant but their distribution across layers is changed, what happens to model performance?

Initial Experiment

The authors took a 440 M‑parameter Transformer and divided its layers into early, middle, and late groups. They widened the FFN in one group while narrowing the others, keeping the overall parameter count unchanged.

Results: concentrating capacity in the early layers reduced validation perplexity from 16.28 to 15.96, whereas moving capacity to the later layers increased perplexity to 17.29.

Introducing Tapered Language Models (TLMs)

Motivated by the above, the researchers defined TLMs: choose a dimension that determines parameter count (e.g., FFN width) and make it monotonically decrease with depth, while preserving the average width.

This transforms a rectangular parameter profile into a wedge‑shaped one, without changing total parameters or compute.

Decay Curves Tested

Linear decay – uniform reduction.

Cosine decay – smooth transition with gentle ends.

Sigmoid (S‑shaped) decay – rapid reduction in the middle.

These curves are analogized to different ways of closing a market stall: linear is steady, sigmoid is abrupt, cosine is a balanced taper.

Comprehensive Scan and Best Configuration

Scanning five width ratios across three decay types, cosine decay with a 1.5× early‑layer width and 0.5× late‑layer width achieved the lowest perplexity of 14.44, a 1.84‑point improvement over the uniform baseline.

Cross‑Architecture Validation

The same cosine‑decay configuration was applied unchanged to three other architectures: gated‑attention models, Hope‑attention (self‑modifying memory), and Titans (neural long‑term memory), at 760 M and 1.3 B scales. In all eight experiments, TLM‑modified models improved average accuracy on commonsense reasoning benchmarks and reduced perplexity on LAMBADA.

Additional long‑context retrieval tests (Needle‑in‑a‑Haystack) confirmed that the reallocation does not harm the ability to handle long contexts.

Analysis of Why Front‑Layer Capacity Helps

Measurements on GPT‑2‑style models showed that deeper layers tend to produce outputs more similar to existing information, indicating they mainly repeat earlier judgments rather than create new understanding. Hence, giving extra capacity to early layers, which are responsible for introducing novel information, is more effective.

Conclusion

The study demonstrates a simple, zero‑cost way to boost LLM performance: reshape the parameter distribution from a uniform rectangle to a tapered wedge, moving "brain capacity" forward. While the optimal taper may vary with model size and architecture, the principle appears broadly applicable, even to vision Transformers and diffusion models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer Language Model perplexity cosine decay FFN width parameter allocation Tapered Language Model

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.