Artificial Intelligence 17 min read

Demystifying the Transformer: Step‑by‑Step PaddlePaddle Implementation

This article provides a comprehensive, code‑rich walkthrough of the Transformer architecture using PaddlePaddle, covering the encoder and decoder components, residual connections, layer normalization, feed‑forward networks, scaled dot‑product and multi‑head attention, and shows how to assemble the full model with training and inference functions.

NewBeeNLP

Apr 16, 2024

Demystifying the Transformer: Step‑by‑Step PaddlePaddle Implementation

Preface

While learning PaddlePaddle, the author recommends starting with the official documentation and then diving into practical code. The article presents a complete Transformer implementation for NLP, including visual diagrams and step‑by‑step code snippets.

1. Encoder Part

The encoder block consists of four main parts: Self‑Attention, Feed‑Forward, Residual Connection, and Layer Normalization.

1.1 Residuals & Layer Norm

Residual connections help mitigate information loss in deep networks, as described in the ResNet paper. Layer Normalization is used to stabilize training, following the normalization models in deep learning literature.

def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.):
    """Apply operations based on process_cmd values"""
    for cmd in process_cmd:
        if cmd == "a":  # add residual connection
            out = out + prev_out if prev_out else out
        elif cmd == "n":  # add layer normalization
            out = layers.layer_norm(out, begin_norm_axis=len(out.shape)-1,
                param_attr=fluid.initializer.Constant(1.),
                bias_attr=fluid.initializer.Constant(0.))
        elif cmd == "d":  # add dropout
            if dropout_rate:
                out = layers.dropout(out, dropout_prob=dropout_rate,
                    seed=dropout_seed, is_test=False)
    return out

1.2 Feed Forward

The feed‑forward layer contains two linear transformations with a ReLU activation in between. It operates position‑wise, sharing parameters across positions but not across layers.

def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate):
    hidden = layers.fc(input=x, size=d_inner_hid, num_flatten_dims=2, act="relu")
    if dropout_rate:
        hidden = layers.dropout(hidden, dropout_prob=dropout_rate, seed=dropout_seed, is_test=False)
    out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2)
    return out

1.3 Self‑Attention

Attention can be understood as a weighted sum. The scaled dot‑product attention follows four steps:

Each input token is embedded into a word vector.

Three learnable matrices project the embeddings into queries (Q), keys (K), and values (V).

Similarity scores are computed between Q and K.

Scores are normalized with softmax and used to weight V, producing the attention output.

Visualization of the computation is provided in the images.

def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
    q = layers.fc(input=queries, size=d_key*n_head, bias_attr=False, num_flatten_dims=2)
    fc_layer = wrap_layer_with_block(layers.fc, fluid.default_main_program().current_block().parent_idx) if cache is not None and static_kv else layers.fc
    k = fc_layer(input=keys, size=d_key*n_head, bias_attr=False, num_flatten_dims=2)
    v = fc_layer(input=values, size=d_value*n_head, bias_attr=False, num_flatten_dims=2)
    return q, k, v

def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
    product = layers.matmul(x=q, y=k, transpose_y=True, alpha=d_key**-0.5)
    if attn_bias:
        product += attn_bias
    weights = layers.softmax(product)
    if dropout_rate:
        weights = layers.dropout(weights, dropout_prob=dropout_rate, seed=dropout_seed, is_test=False)
    out = layers.matmul(weights, v)
    return out

The attn_bias argument is used to mask padding positions in the encoder self‑attention, the decoder encoder‑self‑attention, and the decoder masked‑self‑attention.

1.4 Multi‑Head Attention

Multi‑head attention allows the model to attend to information from different representation subspaces.

def __split_heads_qkv(queries, keys, values, n_head, d_key, d_value):
    reshaped_q = layers.reshape(x=queries, shape=[0, 0, n_head, d_key], inplace=True)
    q = layers.transpose(x=reshaped_q, perm=[0, 2, 1, 3])
    reshaped_k = layers.reshape(x=keys, shape=[0, 0, n_head, d_key], inplace=True)
    k = layers.transpose(x=reshaped_k, perm=[0, 2, 1, 3])
    reshaped_v = layers.reshape(x=values, shape=[0, 0, n_head, d_value], inplace=True)
    v = layers.transpose(x=reshaped_v, perm=[0, 2, 1, 3])
    if cache is not None:
        # cache handling omitted for brevity
        pass
    return q, k, v

def __combine_heads(x):
    if len(x.shape) != 4:
        raise ValueError("Input(x) should be a 4-D Tensor.")
    trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
    return layers.reshape(x=trans_x, shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], inplace=True)

def multi_head_attention(queries, keys, values, attn_bias, d_key, d_value, d_model, n_head=1, dropout_rate=0., cache=None, static_kv=False):
    keys = queries if keys is None else keys
    values = keys if values is None else values
    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
        raise ValueError("Inputs: queries, keys and values should all be 3-D tensors.")
    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
    q, k, v = __split_heads_qkv(q, k, v, n_head, d_key, d_value)
    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_model, dropout_rate)
    out = __combine_heads(ctx_multiheads)
    proj_out = layers.fc(input=out, size=d_model, bias_attr=False, num_flatten_dims=2)
    return proj_out

1.5 Complete Encoder Code

The full encoder can be built by stacking the above components according to the diagram; the complete code is omitted for brevity.

2. Decoder Part

The decoder mirrors the encoder and adds two extra components: Masked Self‑Attention and Encoder‑Decoder Attention.

2.1 Masked Multi‑Head Attention

During decoding, a mask matrix with zeros in the upper‑triangular part prevents the model from attending to future tokens, ensuring causal generation.

2.2 Encoder‑Decoder Attention

Queries come from the decoder’s previous layer, while keys and values are derived from the encoder’s output, allowing the decoder to attend to source representations.

2.3 Full Decoder Code

The decoder implementation follows the same pattern as the encoder, with the additional masked attention and encoder‑decoder attention steps.

3. Full Transformer

By combining the encoder and decoder, a complete Transformer model is constructed. The top‑level function assembles inputs, handles weight sharing, computes loss with optional label smoothing, and returns training metrics.

def transformer(model_input, src_vocab_size, trg_vocab_size, max_length, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, preprocess_cmd, postprocess_cmd, weight_sharing, label_smooth_eps, bos_idx=0, is_test=False):
    if weight_sharing:
        assert src_vocab_size == trg_vocab_size, "Vocabularies in source and target should be same for weight sharing."
    enc_inputs = (model_input.src_word, model_input.src_pos, model_input.src_slf_attn_bias)
    dec_inputs = (model_input.trg_word, model_input.trg_pos, model_input.trg_slf_attn_bias, model_input.trg_src_attn_bias)
    label = model_input.lbl_word
    weights = model_input.lbl_weight
    enc_output = wrap_encoder(enc_inputs, src_vocab_size, max_length, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, preprocess_cmd, postprocess_cmd, weight_sharing, bos_idx=bos_idx)
    predict = wrap_decoder(dec_inputs, trg_vocab_size, max_length, n_layer, n_head, d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout, attention_dropout, relu_dropout, preprocess_cmd, postprocess_cmd, weight_sharing, enc_output=enc_output)
    if label_smooth_eps:
        label = layers.label_smooth(label=layers.one_hot(input=label, depth=trg_vocab_size), epsilon=label_smooth_eps)
    cost = layers.softmax_with_cross_entropy(logits=predict, label=label, soft_label=bool(label_smooth_eps))
    weighted_cost = layers.elementwise_mul(x=cost, y=weights, axis=0)
    sum_cost = layers.reduce_sum(weighted_cost)
    token_num = layers.reduce_sum(weights)
    token_num.stop_gradient = True
    avg_cost = sum_cost / token_num
    return sum_cost, avg_cost, predict, token_num

def create_net(is_training, model_input, args):
    if is_training:
        sum_cost, avg_cost, _, token_num = transformer(
            model_input, args.src_vocab_size, args.trg_vocab_size,
            args.max_length + 1, args.n_layer, args.n_head, args.d_key,
            args.d_value, args.d_model, args.d_inner_hid,
            args.prepostprocess_dropout, args.attention_dropout,
            args.relu_dropout, args.preprocess_cmd, args.postprocess_cmd,
            args.weight_sharing, args.label_smooth_eps, args.bos_idx)
        return sum_cost, avg_cost, token_num
    else:
        out_ids, out_scores = fast_decode(
            model_input, args.src_vocab_size, args.trg_vocab_size,
            args.max_length + 1, args.n_layer, args.n_head, args.d_key,
            args.d_value, args.d_model, args.d_inner_hid,
            args.prepostprocess_dropout, args.attention_dropout,
            args.relu_dropout, args.preprocess_cmd, args.postprocess_cmd,
            args.weight_sharing, args.beam_size, args.max_out_len, args.bos_idx,
            args.eos_idx)
        return out_ids, out_scores

References

PaddlePaddle official documentation: https://www.paddlepaddle.org.cn/

Deep Residual Learning for Image Recognition (ResNet): https://arxiv.org/abs/1512.03385

Normalization models in deep learning: https://zhuanlan.zhihu.com/p/43200897

Understanding the Attention Mechanism: https://blog.csdn.net/Kaiyuan_sjtu/article/details/81806123

Python deep learning Attention Mechanism Decoder PaddlePaddle Encoder

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.