Why Can BERT’s Token, Segment, and Position Embeddings Be Added? A Deep Dive into Positional Encoding

This article revisits the long‑standing question of why BERT’s token, segment, and position embeddings are summed, critiques earlier explanations, and presents findings from the ICLR‑2021 paper “Rethinking Positional Encoding in Language Pre‑training” that show removing the token‑position cross term speeds convergence and improves downstream GLUE scores.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Can BERT’s Token, Segment, and Position Embeddings Be Added? A Deep Dive into Positional Encoding

Background

The author originally answered a popular Zhihu question about why the three embeddings used in BERT—token, segment, and position—can be simply added together. The answer attracted many followers but contained misconceptions that were later corrected by Guo Lin, the author of LightGBM, in his ICLR‑2021 paper.

Original Answer and Its Flaws

Initial reasoning was threefold:

Transformer lacks strong contextual awareness, so adding embeddings serves as a feature‑crossing mechanism that injects contextual semantics.

Tokens in BERT are BPE units or Chinese characters, which are coarser than words; adding positional embeddings gives these coarse tokens a more individualized representation.

The addition is not a pooling operation but merely a way to incorporate positional information.

Upon reflection, points 2 and 3 were flawed: relative positions of tokens matter more than absolute positions, and simple addition does not truly capture token‑position interactions.

Insights from the ICLR 2021 Paper

Guo Lin’s paper investigates whether positional embeddings should be combined multiplicatively with token embeddings. The authors derive a more expressive attention formulation that includes four terms: token‑to‑token, token‑to‑position, position‑to‑token, and position‑to‑position. The original BERT formulation adds token and position embeddings before self‑attention, effectively ignoring the cross terms.

The paper proposes removing the token‑position cross terms while keeping the token‑to‑token and position‑to‑position components. To preserve dimensional consistency, a scaling factor of 1/√(2d) is introduced.

The authors also visualize attention weights for the four interaction types, showing that position‑to‑token and token‑to‑position contributions are relatively weak.

Experimental Findings

Removing the token‑position cross term yields several practical benefits:

Faster convergence during pre‑training.

Higher average scores on downstream GLUE benchmarks.

The paper also notes a special handling of the [CLS] token’s position embedding to avoid excessive positional locality that could suppress sentence‑level information.

Conclusion

Incorporating positional information is essential, but naïvely adding it to token embeddings does not provide a useful cross‑interaction. Empirical evidence suggests that eliminating the token‑position multiplication simplifies the model, accelerates pre‑training, and improves downstream performance, underscoring the importance of experimentally validating intuition.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerEmbeddingPositional EncodingBERTLanguage Pretraining
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.