Tagged articles
4 articles
Page 1 of 1
AgentGuide
AgentGuide
Apr 12, 2026 · Artificial Intelligence

What Is a Token? A Deep Dive into Tokenization Algorithms for LLMs

The article defines tokens (now officially called “词元”), explains why large language models require numeric input, and details three main tokenization strategies—word‑based, character‑based, and subword—along with the sub‑methods BPE, WordPiece, and Unigram, highlighting their advantages and drawbacks.

BPELLMUnigram
0 likes · 6 min read
What Is a Token? A Deep Dive into Tokenization Algorithms for LLMs
Code Mala Tang
Code Mala Tang
Mar 27, 2025 · Artificial Intelligence

How Do BPE, WordPiece, and SentencePiece Shape Modern NLP Tokenization?

This article explains the fundamentals, workflows, examples, and trade‑offs of three major subword tokenization algorithms—Byte Pair Encoding, WordPiece, and SentencePiece—helping practitioners choose the right method for their large language model pipelines.

BPENLPSentencePiece
0 likes · 12 min read
How Do BPE, WordPiece, and SentencePiece Shape Modern NLP Tokenization?
Nightwalker Tech
Nightwalker Tech
Jul 18, 2023 · Artificial Intelligence

Implementing the Input Processing Layer of a Transformer Model: Tokenization, Embedding, and Positional Encoding

This article explains how to build the input processing stage of a Transformer—including tokenization with Hugging Face tokenizers, token‑to‑embedding conversion using BERT models, custom BPE tokenizers, and positional encoding—providing complete Python code examples and test results.

BPEEmbeddingPositional Encoding
0 likes · 14 min read
Implementing the Input Processing Layer of a Transformer Model: Tokenization, Embedding, and Positional Encoding
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 25, 2018 · Artificial Intelligence

How to Crush the Kaggle Toxic Comment Challenge: Data Prep, Models, and Ensemble Secrets

This article breaks down the Kaggle toxic comment classification competition, detailing thorough data cleaning, advanced word‑vector techniques, pseudo‑labeling, BPE tokenization, diverse neural models and ensemble strategies, and shares practical insights and pitfalls from the author's nine‑month competition journey.

BPEKaggleNLP
0 likes · 9 min read
How to Crush the Kaggle Toxic Comment Challenge: Data Prep, Models, and Ensemble Secrets