Artificial Intelligence 14 min read

Implementing the Input Processing Layer of a Transformer Model: Tokenization, Embedding, and Positional Encoding

This article explains how to build the input processing stage of a Transformer—including tokenization with Hugging Face tokenizers, token‑to‑embedding conversion using BERT models, custom BPE tokenizers, and positional encoding—providing complete Python code examples and test results.

Nightwalker Tech
Nightwalker Tech
Nightwalker Tech
Implementing the Input Processing Layer of a Transformer Model: Tokenization, Embedding, and Positional Encoding

The article describes the implementation of the Inputs Process layer of a Transformer model, which converts raw text into representations that can be consumed by the encoder.

It first outlines the three main steps: tokenizing the input string into tokens, mapping those tokens to dense vectors via an input embedding matrix, and adding positional encodings so the model can capture token order.

For tokenization, a Tokenizer class based on Hugging Face's AutoTokenizer is presented, followed by a usage example that tokenizes Chinese and English sentences and prints the token lists and IDs.

import torch
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel

class Tokenizer:
    def __init__(self, model_path):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
    def tokenize(self, text):
        return self.tokenizer.tokenize(text)
    def convert_tokens_to_ids(self, tokens):
        return self.tokenizer.convert_tokens_to_ids(tokens)
    def convert_ids_to_tokens(self, ids):
        return self.tokenizer.convert_ids_to_tokens(ids)
    def convert_tokens_to_string(self, tokens):
        return self.tokenizer.convert_tokens_to_string(tokens)

class InputEmbedding:
    def __init__(self, model_path):
        self.embedding_model = BertModel.from_pretrained(model_path)
        self.tokenizer = BertTokenizer.from_pretrained(model_path)
    def get_seq_embedding(self, sequence):
        input_tokens = self.tokenizer(sequence, return_tensors='pt')
        output_tensors = self.embedding_model(**input_tokens)
        return output_tensors
    def get_input_seq_ids(self, sequence):
        return self.tokenizer(sequence, return_tensors='pt')
    def get_input_tokens_embedding(self, input_tokens):
        return self.embedding_model(**input_tokens)

A test loop iterates over two example sentences, prints tokens, IDs, and the shape of the embedding tensor, demonstrating that each token is represented by a 768‑dimensional vector.

MODEL_BERT_BASE_ZH = "D:/Data/Models/roc-bert-base-zh"
MODEL_BERT_BASE_CHINESE = "bert-base-chinese"
seqs = ['我的名字叫做黑夜路人', 'My name is Black']

tokenizer = Tokenizer(MODEL_BERT_BASE_ZH)
input_embedding = InputEmbedding(MODEL_BERT_BASE_CHINESE)

for seq in seqs:
    tokens = tokenizer.tokenize(seq)
    print(seq, ' => ', tokens)
    ids = tokenizer.convert_tokens_to_ids(tokens)
    print(seq, ' => ', ids)
    s = input_embedding.get_seq_embedding(seq)
    print(s[0].shape)
    print(s[0])

The article then introduces a custom BPE‑based tokenizer (named BlackTokenizer ) with encode and decode methods, showing how raw strings are turned into integer token lists and back.

# API: BlackTokenizer.encode(text)
# 将一个字符串编码成一个整数列表(tokens)
def encode(self, text):
    """Transforms a string into an array of tokens"""
    if not isinstance(text, str):
        text = text.decode(self._DEFAULT_ENCODING)
    bpe_tokens = []
    matches = self._regex_compiled.findall(text)
    for token in matches:
        token = ''.join([self._byte_encoder[x] for x in self._encode_string(token)])
        new_tokens = [self._encoder[x] for x in self._bpe(token, self._bpe_ranks).split(' ')]
        bpe_tokens.extend(new_tokens)
    return bpe_tokens

# API: BlackTokenizer.decode(tokens)
# 将输入的整数列表 tokens 转换成原始字符串
def decode(self, tokens):
    """Transforms back an array of tokens into the original string"""
    text = ''.join([self._decoder[x] for x in tokens])
    textarr = [int(self._byte_decoder[x]) for x in list(text)]
    text = bytearray(textarr).decode("utf-8")
    return text

To handle positional information, a PositionalEncoding class is provided, implementing the sinusoidal formulas described in the original Transformer paper.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(1000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

Additional test code demonstrates the custom tokenizer on multilingual strings, showing token lists, IDs, token counts, and round‑trip decoding results.

seqs = ['我的名字叫做黑夜路人', 'My name is Black', "我的nickname叫heiyeluren", "はじめまして", "잘 부탁 드립니다", "До свидания!", "😊😁😄😉😆🤝👋", "今天的状态很happy,表情是😁"]
print('\n------------------BlackTokenize Test------------------')

tk = BlackTokenize()
for seq in seqs:
    token_list = tk.get_token_list(seq)
    enc_seq = tk.encode(seq)
    dec_seq = tk.decode(enc_seq)
    token_count = tk.count_tokens(seq)
    print('RawText:', seq, '=> TokenList:', token_list, '=> TokenIDs', enc_seq, '=> TokenCount:', token_count, '=> DecodeText:', dec_seq)
print('------------------BlackTokenize Test------------------\n')

The article notes that many tokenizers exist (BERT, spaCy, tiktoken, etc.) and that the choice influences model performance, especially for multilingual data.

For readers who want the full source, a GitHub repository (https://github.com/heiyeluren/black-transformer) is referenced.

PythontransformerEmbeddingtokenizationPyTorchPositional EncodingBPE
Nightwalker Tech
Written by

Nightwalker Tech

[Nightwalker Tech] is the tech sharing channel of "Nightwalker", focusing on AI and large model technologies, internet architecture design, high‑performance networking, and server‑side development (Golang, Python, Rust, PHP, C/C++).

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.