Demystifying the Transformer: Step‑by‑Step PaddlePaddle Implementation
This article provides a comprehensive, code‑rich walkthrough of the Transformer architecture using PaddlePaddle, covering the encoder and decoder components, residual connections, layer normalization, feed‑forward networks, scaled dot‑product and multi‑head attention, and shows how to assemble the full model with training and inference functions.
Preface
While learning PaddlePaddle, the author recommends starting with the official documentation and then diving into practical code. The article presents a complete Transformer implementation for NLP, including visual diagrams and step‑by‑step code snippets.
1. Encoder Part
The encoder block consists of four main parts: Self‑Attention, Feed‑Forward, Residual Connection, and Layer Normalization.
1.1 Residuals & Layer Norm
Residual connections help mitigate information loss in deep networks, as described in the ResNet paper. Layer Normalization is used to stabilize training, following the normalization models in deep learning literature.
def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.):
"""Apply operations based on process_cmd values"""
for cmd in process_cmd:
if cmd == "a": # add residual connection
out = out + prev_out if prev_out else out
elif cmd == "n": # add layer normalization
out = layers.layer_norm(out, begin_norm_axis=len(out.shape)-1,
param_attr=fluid.initializer.Constant(1.),
bias_attr=fluid.initializer.Constant(0.))
elif cmd == "d": # add dropout
if dropout_rate:
out = layers.dropout(out, dropout_prob=dropout_rate,
seed=dropout_seed, is_test=False)
return out1.2 Feed Forward
The feed‑forward layer contains two linear transformations with a ReLU activation in between. It operates position‑wise, sharing parameters across positions but not across layers.
def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate):
hidden = layers.fc(input=x, size=d_inner_hid, num_flatten_dims=2, act="relu")
if dropout_rate:
hidden = layers.dropout(hidden, dropout_prob=dropout_rate, seed=dropout_seed, is_test=False)
out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2)
return out1.3 Self‑Attention
Attention can be understood as a weighted sum. The scaled dot‑product attention follows four steps:
Each input token is embedded into a word vector.
Three learnable matrices project the embeddings into queries (Q), keys (K), and values (V).
Similarity scores are computed between Q and K.
Scores are normalized with softmax and used to weight V, producing the attention output.
Visualization of the computation is provided in the images.
def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
q = layers.fc(input=queries, size=d_key*n_head, bias_attr=False, num_flatten_dims=2)
fc_layer = wrap_layer_with_block(layers.fc, fluid.default_main_program().current_block().parent_idx) if cache is not None and static_kv else layers.fc
k = fc_layer(input=keys, size=d_key*n_head, bias_attr=False, num_flatten_dims=2)
v = fc_layer(input=values, size=d_value*n_head, bias_attr=False, num_flatten_dims=2)
return q, k, v def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
product = layers.matmul(x=q, y=k, transpose_y=True, alpha=d_key**-0.5)
if attn_bias:
product += attn_bias
weights = layers.softmax(product)
if dropout_rate:
weights = layers.dropout(weights, dropout_prob=dropout_rate, seed=dropout_seed, is_test=False)
out = layers.matmul(weights, v)
return outThe attn_bias argument is used to mask padding positions in the encoder self‑attention, the decoder encoder‑self‑attention, and the decoder masked‑self‑attention.
1.4 Multi‑Head Attention
Multi‑head attention allows the model to attend to information from different representation subspaces.
def __split_heads_qkv(queries, keys, values, n_head, d_key, d_value):
reshaped_q = layers.reshape(x=queries, shape=[0, 0, n_head, d_key], inplace=True)
q = layers.transpose(x=reshaped_q, perm=[0, 2, 1, 3])
reshaped_k = layers.reshape(x=keys, shape=[0, 0, n_head, d_key], inplace=True)
k = layers.transpose(x=reshaped_k, perm=[0, 2, 1, 3])
reshaped_v = layers.reshape(x=values, shape=[0, 0, n_head, d_value], inplace=True)
v = layers.transpose(x=reshaped_v, perm=[0, 2, 1, 3])
if cache is not None:
# cache handling omitted for brevity
pass
return q, k, v def __combine_heads(x):
if len(x.shape) != 4:
raise ValueError("Input(x) should be a 4-D Tensor.")
trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True) def multi_head_attention(queries, keys, values, attn_bias, d_key, d_value, d_model, n_head=1, dropout_rate=0., cache=None, static_kv=False):
keys = queries if keys is None else keys
values = keys if values is None else values
if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
raise ValueError("Inputs: queries, keys and values should all be 3-D tensors.")
q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
q, k, v = __split_heads_qkv(q, k, v, n_head, d_key, d_value)
ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_model, dropout_rate)
out = __combine_heads(ctx_multiheads)
proj_out = layers.fc(input=out, size=d_model, bias_attr=False, num_flatten_dims=2)
return proj_out1.5 Complete Encoder Code
The full encoder can be built by stacking the above components according to the diagram; the complete code is omitted for brevity.
2. Decoder Part
The decoder mirrors the encoder and adds two extra components: Masked Self‑Attention and Encoder‑Decoder Attention.
2.1 Masked Multi‑Head Attention
During decoding, a mask matrix with zeros in the upper‑triangular part prevents the model from attending to future tokens, ensuring causal generation.
2.2 Encoder‑Decoder Attention
Queries come from the decoder’s previous layer, while keys and values are derived from the encoder’s output, allowing the decoder to attend to source representations.
2.3 Full Decoder Code
The decoder implementation follows the same pattern as the encoder, with the additional masked attention and encoder‑decoder attention steps.
3. Full Transformer
By combining the encoder and decoder, a complete Transformer model is constructed. The top‑level function assembles inputs, handles weight sharing, computes loss with optional label smoothing, and returns training metrics.
def transformer(model_input, src_vocab_size, trg_vocab_size, max_length, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, preprocess_cmd, postprocess_cmd, weight_sharing, label_smooth_eps, bos_idx=0, is_test=False):
if weight_sharing:
assert src_vocab_size == trg_vocab_size, "Vocabularies in source and target should be same for weight sharing."
enc_inputs = (model_input.src_word, model_input.src_pos, model_input.src_slf_attn_bias)
dec_inputs = (model_input.trg_word, model_input.trg_pos, model_input.trg_slf_attn_bias, model_input.trg_src_attn_bias)
label = model_input.lbl_word
weights = model_input.lbl_weight
enc_output = wrap_encoder(enc_inputs, src_vocab_size, max_length, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, preprocess_cmd, postprocess_cmd, weight_sharing, bos_idx=bos_idx)
predict = wrap_decoder(dec_inputs, trg_vocab_size, max_length, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, preprocess_cmd, postprocess_cmd, weight_sharing, enc_output=enc_output)
if label_smooth_eps:
label = layers.label_smooth(label=layers.one_hot(input=label, depth=trg_vocab_size), epsilon=label_smooth_eps)
cost = layers.softmax_with_cross_entropy(logits=predict, label=label, soft_label=bool(label_smooth_eps))
weighted_cost = layers.elementwise_mul(x=cost, y=weights, axis=0)
sum_cost = layers.reduce_sum(weighted_cost)
token_num = layers.reduce_sum(weights)
token_num.stop_gradient = True
avg_cost = sum_cost / token_num
return sum_cost, avg_cost, predict, token_num def create_net(is_training, model_input, args):
if is_training:
sum_cost, avg_cost, _, token_num = transformer(
model_input, args.src_vocab_size, args.trg_vocab_size,
args.max_length + 1, args.n_layer, args.n_head, args.d_key,
args.d_value, args.d_model, args.d_inner_hid,
args.prepostprocess_dropout, args.attention_dropout,
args.relu_dropout, args.preprocess_cmd, args.postprocess_cmd,
args.weight_sharing, args.label_smooth_eps, args.bos_idx)
return sum_cost, avg_cost, token_num
else:
out_ids, out_scores = fast_decode(
model_input, args.src_vocab_size, args.trg_vocab_size,
args.max_length + 1, args.n_layer, args.n_head, args.d_key,
args.d_value, args.d_model, args.d_inner_hid,
args.prepostprocess_dropout, args.attention_dropout,
args.relu_dropout, args.preprocess_cmd, args.postprocess_cmd,
args.weight_sharing, args.beam_size, args.max_out_len, args.bos_idx,
args.eos_idx)
return out_ids, out_scoresReferences
PaddlePaddle official documentation: https://www.paddlepaddle.org.cn/
Deep Residual Learning for Image Recognition (ResNet): https://arxiv.org/abs/1512.03385
Normalization models in deep learning: https://zhuanlan.zhihu.com/p/43200897
Understanding the Attention Mechanism: https://blog.csdn.net/Kaiyuan_sjtu/article/details/81806123
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
