Improving BERT Pre‑training with RealFormer: Principles, Implementation, and Empirical Evaluation
This article analyzes the RealFormer modification to the Transformer architecture, details its implementation in BERT, and presents extensive experiments showing that while RealFormer can boost performance on low‑label‑count classification tasks, its benefits diminish or disappear as the number of classes grows.
Author Qiu Zhenyu, an algorithm engineer at Huatai Securities, continues his series on BERT pre‑training by exploring the RealFormer approach, which adds an attention residual path to the standard Transformer architecture.
RealFormer Principle Overview
Post‑LN Deficiency
In the original Transformer, layer normalization (LN) is applied after the residual addition (post‑LN), which weakens the shortcut path and can cause gradient vanishing in deep models. The article illustrates the gradient flow of post‑LN with diagrams.
Pre‑LN vs. Post‑LN
Moving LN before the residual (pre‑LN) restores a direct gradient path, but empirical studies show pre‑LN often yields faster convergence yet slightly lower final performance compared to post‑LN.
RealFormer Mechanism
RealFormer retains the post‑LN structure but introduces an extra residual branch that carries the raw attention scores (before softmax) forward, effectively providing a shortcut for attention information. This design is simple yet effective, as shown in the original RealFormer paper.
Implementation Details
The author adapts the TensorFlow BERT codebase (modeling.py) by adding two modifications:
def attention_layer(...,prev_attens=None):
...
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))
if prev_attens != None:
attention_scores += prev_attens
prev_attens = attention_scores
...
return context_layer,prev_attensand
prev_attention = None
for layer_idx in range(num_hidden_layers):
...
attention_head,prev_attention = attention_layer(...,prev_attens=prev_attention)Experimental Validation
RealFormer‑enhanced BERT was fine‑tuned on several Chinese text‑classification datasets, including a 20‑class THUCNews subset, an internal sentiment‑analysis set, and large‑scale IFlytek (119 classes). Training used 4 GPUs, batch size 12 per GPU, gradient accumulation 8, and started from the Chinese‑wwm‑roberta checkpoint.
Key observations:
On low‑label tasks (e.g., THUCNews, sentiment), RealFormer achieved comparable or modestly better accuracy than the baseline.
On high‑label tasks (IFlytek, internal 140‑class set), RealFormer often converged to a sub‑optimal local minimum, yielding higher loss and lower precision/recall/F1 than the standard model.
Increasing the number of pre‑training steps mitigated the issue up to a point (e.g., 3M steps helped 80‑class experiments), but performance still lagged behind the baseline when label count grew large.
Discussion
The author hypothesizes that RealFormer’s attention‑score shortcut may over‑fit or bias learning when the classification space becomes too fine‑grained, especially under limited data or class imbalance. No definitive theoretical explanation is provided.
Conclusion
RealFormer can improve BERT pre‑training for tasks with a modest number of classes, particularly when the downstream task is difficult, but its advantage disappears or reverses for tasks with many categories. Practitioners should consider label cardinality before adopting RealFormer.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
