Artificial Intelligence 12 min read

Improving BERT Pre‑training with RealFormer: Principles, Implementation, and Empirical Evaluation

This article analyzes the RealFormer modification to the Transformer architecture, details its implementation in BERT, and presents extensive experiments showing that while RealFormer can boost performance on low‑label‑count classification tasks, its benefits diminish or disappear as the number of classes grows.

Sohu Tech Products

Feb 17, 2021

Improving BERT Pre‑training with RealFormer: Principles, Implementation, and Empirical Evaluation

Author Qiu Zhenyu, an algorithm engineer at Huatai Securities, continues his series on BERT pre‑training by exploring the RealFormer approach, which adds an attention residual path to the standard Transformer architecture.

RealFormer Principle Overview

Post‑LN Deficiency

In the original Transformer, layer normalization (LN) is applied after the residual addition (post‑LN), which weakens the shortcut path and can cause gradient vanishing in deep models. The article illustrates the gradient flow of post‑LN with diagrams.

Pre‑LN vs. Post‑LN

Moving LN before the residual (pre‑LN) restores a direct gradient path, but empirical studies show pre‑LN often yields faster convergence yet slightly lower final performance compared to post‑LN.

RealFormer Mechanism

RealFormer retains the post‑LN structure but introduces an extra residual branch that carries the raw attention scores (before softmax) forward, effectively providing a shortcut for attention information. This design is simple yet effective, as shown in the original RealFormer paper.

Implementation Details

The author adapts the TensorFlow BERT codebase (modeling.py) by adding two modifications:

def attention_layer(...,prev_attens=None):
    ...
    attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))
    if prev_attens != None:
      attention_scores += prev_attens
    prev_attens = attention_scores
    ...
    return context_layer,prev_attens

and

prev_attention = None
for layer_idx in range(num_hidden_layers):
    ...
    attention_head,prev_attention = attention_layer(...,prev_attens=prev_attention)

Experimental Validation

RealFormer‑enhanced BERT was fine‑tuned on several Chinese text‑classification datasets, including a 20‑class THUCNews subset, an internal sentiment‑analysis set, and large‑scale IFlytek (119 classes). Training used 4 GPUs, batch size 12 per GPU, gradient accumulation 8, and started from the Chinese‑wwm‑roberta checkpoint.

Key observations:

On low‑label tasks (e.g., THUCNews, sentiment), RealFormer achieved comparable or modestly better accuracy than the baseline.

On high‑label tasks (IFlytek, internal 140‑class set), RealFormer often converged to a sub‑optimal local minimum, yielding higher loss and lower precision/recall/F1 than the standard model.

Increasing the number of pre‑training steps mitigated the issue up to a point (e.g., 3M steps helped 80‑class experiments), but performance still lagged behind the baseline when label count grew large.

Discussion

The author hypothesizes that RealFormer’s attention‑score shortcut may over‑fit or bias learning when the classification space becomes too fine‑grained, especially under limited data or class imbalance. No definitive theoretical explanation is provided.

Conclusion

RealFormer can improve BERT pre‑training for tasks with a modest number of classes, particularly when the downstream task is difficult, but its advantage disappears or reverses for tasks with many categories. Practitioners should consider label cardinality before adopting RealFormer.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer attention pretraining BERT RealFormer Residual

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.