BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More
An in‑depth Q&A breaks down core BERT concepts—from the purpose of the [CLS] token and masking strategies to self‑attention complexity, sparse attention tricks, subword handling of OOV words, warm‑up learning rates, GPT’s unidirectional nature, and ALBERT’s parameter sharing—providing concise explanations for each.
CLS token in BERT
The [CLS] token is prepended to every input sequence. Its final hidden state is used as a pooled representation for sentence‑level tasks (e.g., classification) because self‑attention treats all positions symmetrically, allowing the [CLS] embedding to aggregate global information without bias toward any specific token.
Alternative sentence representations
When a [CLS] token is not desired, practitioners can concatenate max‑pooling and average‑pooling over the token embeddings and feed the result to a downstream classifier. This was common before BERT but may be less favored in interviews.
Mask usage in BERT
Pre‑training (Masked Language Modeling) : Random tokens are replaced with [MASK] to create a cloze task; the model learns to predict the original tokens.
Self‑attention mask : A binary mask marks padding positions. Before the softmax, the mask adds -inf to those scores, effectively nullifying their contribution.
Decoder (causal) mask : In downstream generation or seq2seq settings, a lower‑triangular mask prevents attention to future tokens, avoiding information leakage.
Self‑attention computational complexity
For a hidden dimension d and sequence length L, the attention matrix requires O(d·L²) operations because each token attends to every other token.
Complexity‑reduction techniques
Sparse or local attention restricts each token to a subset of positions (e.g., a sliding window or block‑sparse pattern). This reduces the number of pairwise interactions, lowering both memory and compute while preserving most local semantic relationships. Variants include:
Sliding‑window attention (O(d·L·w) where w is window size)
Block‑sparse patterns (e.g., Longformer, BigBird)
Out‑of‑vocabulary (OOV) handling
BERT employs WordPiece subword tokenization. Words are broken into frequent subword units, enabling the model to represent rare words, typos, and morphological variants.
Chinese OOV
Chinese text is tokenized at the character level; each character is a token, so OOV issues are largely avoided.
Why BERT outperforms earlier char/subword models
Earlier models were shallow (typically ≤2 LSTM layers) and could not scale depth, limiting capacity and generalization. BERT increases depth (12–24 transformer layers) while keeping the input tokenization simple, allowing non‑linear capacity to grow smoothly and improving both representation power and generalization.
Warm‑up learning‑rate schedule
Training starts with a small learning rate that linearly increases for a few thousand steps (the warm‑up phase) before decaying. This stabilizes early updates, prevents the optimizer from over‑fitting to a few initial batches, and yields smoother convergence.
Directionality: GPT vs. BERT
GPT uses a causal mask so each token can attend only to previous tokens, making it strictly left‑to‑right (unidirectional). BERT’s self‑attention has no causal restriction; tokens attend bidirectionally, enabling richer contextual encoding.
Polysemy handling
Self‑attention allows each token’s representation to be dynamically influenced by its surrounding context. Consequently, the same word can acquire different embeddings depending on the sentence, effectively disambiguating multiple senses.
Transformer differences
Original Transformer (Vaswani et al., 2017) uses fixed sinusoidal positional encodings. BERT replaces these with learned position embeddings, allowing the model to adapt positional information during training.
ALBERT parameter compression
ALBERT shares the same transformer layer parameters across all layers (parameter‑tying). This reduces the total number of trainable parameters dramatically. However, during inference the shared layers must still be applied sequentially, so runtime latency is not reduced.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
