Tagged articles

Subword Tokenization

3 articles · Page 1 of 1

Feb 14, 2025 · Artificial Intelligence

Why Sub‑word Tokenizers Power Modern LLMs: From Characters to Tokens

This article explains how language models evolved from character‑level embeddings to word‑level and finally to sub‑word tokenizers, highlighting the efficiency, vocabulary coverage, and practical engineering challenges of sub‑word segmentation in modern AI systems.

AI FundamentalsLLMSubword Tokenization

0 likes · 8 min read

Why Sub‑word Tokenizers Power Modern LLMs: From Characters to Tokens

Baobao Algorithm Notes

Jan 14, 2022 · Artificial Intelligence

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

An in‑depth Q&A breaks down core BERT concepts—from the purpose of the [CLS] token and masking strategies to self‑attention complexity, sparse attention tricks, subword handling of OOV words, warm‑up learning rates, GPT’s unidirectional nature, and ALBERT’s parameter sharing—providing concise explanations for each.

BERTMaskingSelf-Attention

0 likes · 7 min read

BERT Interview Q&A: Decoding CLS, Masks, Complexity, and More

DataFunTalk

Jul 30, 2021 · Artificial Intelligence

Fundamentals of Natural Language Processing: Language Models, Smoothing, and Basic Tasks

This article provides a comprehensive overview of natural language processing fundamentals, covering the challenges of language modeling, N‑gram and Markov assumptions, smoothing techniques such as discounting and add‑one, evaluation via perplexity, basic tasks like Chinese word segmentation, subword tokenization, POS tagging, syntactic and semantic parsing, and a range of downstream applications including information extraction, sentiment analysis, question answering, machine translation, and dialogue systems.

AILanguage ModelNLP

0 likes · 29 min read

Fundamentals of Natural Language Processing: Language Models, Smoothing, and Basic Tasks