Artificial Intelligence 11 min read

Simplifying Transformer Blocks: Removing Residual Connections, LayerNorm, and Other Components without Losing Performance

A recent ETH Zurich paper shows that standard Transformer blocks can be drastically simplified by removing residual connections, LayerNorm, projection and value parameters, and even MLP sub‑block components, achieving up to 16% fewer parameters and comparable training speed and downstream performance on both GPT‑style decoders and BERT models.

Rare Earth Juejin Tech Community

Dec 8, 2023

Simplifying Transformer Blocks: Removing Residual Connections, LayerNorm, and Other Components without Losing Performance

The Transformer architecture underpins many recent breakthroughs in deep learning, typically built by stacking identical, complex blocks composed of attention, normalization, projection, and MLP sub‑components. Since its 2017 introduction, few works have altered the internal structure of these blocks.

In a new paper from ETH Zurich, researchers investigate whether the standard Transformer block can be simplified without harming convergence or downstream task performance. Using signal propagation theory and extensive empirical evidence, they demonstrate that components such as residual connections, LayerNorm, projection/value matrices, and serialized MLP sub‑blocks can be removed, yielding a leaner decoder architecture similar to GPT and an encoder‑style BERT model.

For each component, the authors examine whether its removal affects training speed (both per‑step updates and overall runtime) and what architectural modifications are required to compensate.

Why Simplify the Transformer Block?

Modern neural network designs are intricate, and the role of each component in training dynamics remains unclear. Signal propagation theory, which studies the evolution of geometric information at initialization, has guided many design choices but does not capture the full training dynamics, such as the benefits of residual connections.

Practically, reducing the cost of training and deploying large Transformers can lead to substantial savings. By eliminating non‑essential components, parameter counts drop and throughput improves. The authors report a 16% reduction in parameters and a 16% increase in training and inference throughput while matching the performance of the standard block.

How to Simplify the Transformer Block?

Starting from a Pre‑LayerNorm (Pre‑LN) module, the authors progressively remove components while preserving training speed. Experiments are conducted on an 18‑block, 768‑width causal decoder (CodeParrot dataset) to focus on speed.

Removing the Residual Connection : Setting the attention residual scaling factor to zero initially causes rank collapse, but with careful initialization the model remains trainable.

Removing Projection/Value Parameters : By fixing the value and projection matrices to identity (β_V = β_P = 0) and using appropriate initialization, the model achieves comparable performance with minimal speed loss.

Removing the MLP Sub‑block Residual : This is more challenging; without the residual, training speed degrades significantly under Adam. The authors retain standard activations (e.g., ReLU) and initialization for the MLP.

Removing LayerNorm : By explicitly scaling the residual branch or biasing the attention matrix toward identity, the benefits of LayerNorm can be replicated without the layer.

Experimental Results

Depth Scaling : Extending the simplified block from 18 to 72 layers retains the speed advantage and matches Pre‑LN performance across depths.

BERT Evaluation : On masked language modeling with the GLUE benchmark, the simplified block matches the pre‑training speed of a Crammed Pre‑LN baseline. Removing residual connections without adjusting value/projection parameters again harms speed.

Efficiency Gains : Tables show a 16% reduction in parameters and 16%/9% faster per‑iteration speed for SAS‑P and SAS variants compared to the Pre‑LN baseline. Parallel block implementations achieve modest additional gains.

Long‑Training Regime : Training on CodeParrot for three times more tokens (≈2 B tokens) demonstrates that the simplified blocks maintain or exceed the speed of Pre‑LN blocks even in prolonged training.

For full experimental details, refer to the original paper.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Deep Learning LLM Transformer model simplification Signal Propagation

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.