Dec 15, 2021 · Artificial Intelligence

Why Can BERT’s Token, Segment, and Position Embeddings Be Added? A Deep Dive into Positional Encoding

This article revisits the long‑standing question of why BERT’s token, segment, and position embeddings are summed, critiques earlier explanations, and presents findings from the ICLR‑2021 paper “Rethinking Positional Encoding in Language Pre‑training” that show removing the token‑position cross term speeds convergence and improves downstream GLUE scores.

BERTEmbeddingLanguage Pretraining

0 likes · 6 min read

Why Can BERT’s Token, Segment, and Position Embeddings Be Added? A Deep Dive into Positional Encoding