Artificial Intelligence 7 min read

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

An in‑depth Q&A breaks down core BERT concepts—from the purpose of the [CLS] token and masking strategies to self‑attention complexity, sparse attention tricks, subword handling of OOV words, warm‑up learning rates, GPT’s unidirectional nature, and ALBERT’s parameter sharing—providing concise explanations for each.

Baobao Algorithm Notes

Jan 14, 2022

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

CLS token in BERT

The [CLS] token is prepended to every input sequence. Its final hidden state is used as a pooled representation for sentence‑level tasks (e.g., classification) because self‑attention treats all positions symmetrically, allowing the [CLS] embedding to aggregate global information without bias toward any specific token.

Alternative sentence representations

When a [CLS] token is not desired, practitioners can concatenate max‑pooling and average‑pooling over the token embeddings and feed the result to a downstream classifier. This was common before BERT but may be less favored in interviews.

Mask usage in BERT

Pre‑training (Masked Language Modeling) : Random tokens are replaced with [MASK] to create a cloze task; the model learns to predict the original tokens.

Self‑attention mask : A binary mask marks padding positions. Before the softmax, the mask adds -inf to those scores, effectively nullifying their contribution.

Decoder (causal) mask : In downstream generation or seq2seq settings, a lower‑triangular mask prevents attention to future tokens, avoiding information leakage.

Self‑attention computational complexity

For a hidden dimension d and sequence length L, the attention matrix requires O(d·L²) operations because each token attends to every other token.

Complexity‑reduction techniques

Sparse or local attention restricts each token to a subset of positions (e.g., a sliding window or block‑sparse pattern). This reduces the number of pairwise interactions, lowering both memory and compute while preserving most local semantic relationships. Variants include:

Sliding‑window attention (O(d·L·w) where w is window size)

Block‑sparse patterns (e.g., Longformer, BigBird)

Out‑of‑vocabulary (OOV) handling

BERT employs WordPiece subword tokenization. Words are broken into frequent subword units, enabling the model to represent rare words, typos, and morphological variants.

Chinese OOV

Chinese text is tokenized at the character level; each character is a token, so OOV issues are largely avoided.

Why BERT outperforms earlier char/subword models

Earlier models were shallow (typically ≤2 LSTM layers) and could not scale depth, limiting capacity and generalization. BERT increases depth (12–24 transformer layers) while keeping the input tokenization simple, allowing non‑linear capacity to grow smoothly and improving both representation power and generalization.

Warm‑up learning‑rate schedule

Training starts with a small learning rate that linearly increases for a few thousand steps (the warm‑up phase) before decaying. This stabilizes early updates, prevents the optimizer from over‑fitting to a few initial batches, and yields smoother convergence.

Directionality: GPT vs. BERT

GPT uses a causal mask so each token can attend only to previous tokens, making it strictly left‑to‑right (unidirectional). BERT’s self‑attention has no causal restriction; tokens attend bidirectionally, enabling richer contextual encoding.

Polysemy handling

Self‑attention allows each token’s representation to be dynamically influenced by its surrounding context. Consequently, the same word can acquire different embeddings depending on the sentence, effectively disambiguating multiple senses.

Transformer differences

Original Transformer (Vaswani et al., 2017) uses fixed sinusoidal positional encodings. BERT replaces these with learned position embeddings, allowing the model to adapt positional information during training.

ALBERT parameter compression

ALBERT shares the same transformer layer parameters across all layers (parameter‑tying). This reduces the total number of trainable parameters dramatically. However, during inference the shared layers must still be applied sequentially, so runtime latency is not reduced.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer BERT Self-Attention Warmup Masking Sparse attention Subword Tokenization

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.