Tagged articles

tokenizer

23 articles · Page 1 of 1

Jun 30, 2026 · Artificial Intelligence

Anthropic Releases Claude Sonnet 5: Near‑Opus 4.8 Performance and Stronger Agent Skills

Anthropic’s Claude Sonnet 5 arrives with markedly higher reasoning, tool‑use and programming abilities than Sonnet 4.6, closing the gap to Opus 4.8 while offering a lower price tier, improved safety scores, a new tokenizer that raises token counts, higher rate limits, and mixed developer cost feedback.

AI agentsAnthropicClaude Sonnet 5

0 likes · 10 min read

Anthropic Releases Claude Sonnet 5: Near‑Opus 4.8 Performance and Stronger Agent Skills

Machine Heart

May 29, 2026 · Artificial Intelligence

When a Celebrity Name Stumped LLMs: The Year‑Old Insight Behind Low‑Frequency Token Degradation

A fan's test of the idol Ma Jiaqi exposed a large‑language‑model's inability to generate his name, leading to an analysis that links the failure to low‑frequency token degradation, academic papers on frequency‑aware prompting and training, and a confirming tokenizer change by Anthropic.

ACLAnthropicEMNLP

0 likes · 14 min read

When a Celebrity Name Stumped LLMs: The Year‑Old Insight Behind Low‑Frequency Token Degradation

Geek Labs

May 6, 2026 · Artificial Intelligence

Build a GPT from Scratch and Decode AI Coding Jargon with Two Top GitHub Projects

The article introduces two practical GitHub repositories—how-to-train-your-gpt, a step‑by‑step guide that builds a LLaMA‑style GPT model across 12 chapters, and dictionary-of-ai-coding, a plain‑language glossary of AI‑coding terms—showing how they together provide a complete understanding of modern LLM fundamentals and terminology.

AIGPTGitHub

0 likes · 9 min read

Build a GPT from Scratch and Decode AI Coding Jargon with Two Top GitHub Projects

AI Explorer

Apr 11, 2026 · Artificial Intelligence

How Kronos Redefines Quantitative Analysis with a Financial‑Market Language Model

Kronos, an open‑source large model trained on OHLCV data from over 45 exchanges, treats financial time‑series as a specialized language, using a custom tokenizer and a two‑stage Transformer to enable price prediction, market state detection, signal generation, and risk simulation, with easy Hugging Face integration and a live demo for BTC/USDT.

KronosOpen-sourceTransformer

0 likes · 6 min read

How Kronos Redefines Quantitative Analysis with a Financial‑Market Language Model

Weekly Large Model Application

Mar 22, 2026 · Artificial Intelligence

Inside MiMo-Audio: Dissecting the Large-Scale Audio Model

The article breaks down MiMo-Audio, a next‑token‑prediction‑style large‑scale audio model built on Qwen2, detailing its acoustic front‑end, RVQ tokenizer, patch‑based transformer architecture, streaming capabilities, performance advantages, engineering constraints, and recommended application scenarios.

Audio ModelingFew-shotPATCH

0 likes · 9 min read

Inside MiMo-Audio: Dissecting the Large-Scale Audio Model

Fun with Large Models

Jan 14, 2026 · Artificial Intelligence

Understanding Large Language Model Files: Structure, Tokens, and Inference with Qwen3

This article walks through the complete workflow of loading and running the open‑source Qwen3‑8B model, explaining each core file (weights, config, generation config, tokenizer), how the model tokenizes input, applies chat templates, generates responses, and decodes output, all illustrated with code and diagrams.

ModelScopePythonQwen3

0 likes · 16 min read

Understanding Large Language Model Files: Structure, Tokens, and Inference with Qwen3

Tencent Technical Engineering

Dec 24, 2025 · Artificial Intelligence

Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer

This article walks through constructing a small large‑language model from the ground up, covering model architecture, tokenization methods, BPE vocabulary building, embedding, positional encoding, attention mechanisms, multi‑head attention, transformer blocks, training pipelines, inference, and sampling strategies, all with runnable Python code.

Deep LearningLLMPython

0 likes · 34 min read

Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer

AI Frontier Lectures

Jul 31, 2025 · Artificial Intelligence

Can a 32‑Token Compressor Generate Images Without Training?

This article reviews a recent study that demonstrates how a highly compressed one‑dimensional tokenizer, using only 32 discrete tokens and gradient‑based test‑time optimization, can generate high‑quality images without training a separate generative model, and explores its methodology, findings, applications, and limitations.

1D tokenizerAI researchTiTok

0 likes · 10 min read

Can a 32‑Token Compressor Generate Images Without Training?

Architect

May 14, 2025 · Artificial Intelligence

How Qwen3 Controls Hybrid Reasoning with the enable_thinking Parameter

This article explains how Qwen3 implements hybrid (fast/slow) reasoning by using the enable_thinking flag in the tokenizer's apply_chat_template method, detailing the underlying Jinja2 chat template, example prompts, the effect of toggling the flag, and design considerations for future autonomous thinking control.

AI modelChatMLHybrid Reasoning

0 likes · 13 min read

How Qwen3 Controls Hybrid Reasoning with the enable_thinking Parameter

Open Source Tech Hub

Mar 31, 2025 · Backend Development

How to Implement Powerful Full‑Text Search in PHP with TNTSearch

This guide explains how to install, configure, and use the PHP‑based TNTSearch engine, covering its key features, required dependencies, index creation, various search modes, dynamic updates, custom tokenizers, geo‑search, and text classification with practical code examples.

Full-Text SearchSearch Enginegeo-search

0 likes · 9 min read

How to Implement Powerful Full‑Text Search in PHP with TNTSearch

AI Large Model Application Practice

Feb 14, 2025 · Artificial Intelligence

Why Sub‑word Tokenizers Power Modern LLMs: From Characters to Tokens

This article explains how language models evolved from character‑level embeddings to word‑level and finally to sub‑word tokenizers, highlighting the efficiency, vocabulary coverage, and practical engineering challenges of sub‑word segmentation in modern AI systems.

AI FundamentalsLLMSubword Tokenization

0 likes · 8 min read

Why Sub‑word Tokenizers Power Modern LLMs: From Characters to Tokens

Alibaba Cloud Developer

Nov 28, 2024 · Artificial Intelligence

Understanding Tokenizers and Embeddings in Large Language Models

This article introduces the core concepts of tokenizers and embeddings in large language models, explains how they convert text into numeric IDs and dense vectors, compares different tokenization strategies, and provides practical JavaScript and TensorFlow.js code examples for beginners.

AI FundamentalsJavaScriptLLM

0 likes · 10 min read

Understanding Tokenizers and Embeddings in Large Language Models

Baobao Algorithm Notes

Sep 24, 2024 · Artificial Intelligence

From Zero to One: A Practical Guide to Pretraining Large Language Models

This comprehensive guide walks you through every stage of LLM pretraining—from data sourcing, cleaning, and deduplication to tokenizer design, model architecture choices, training framework selection, optimization tricks, and evaluation methods—highlighting common pitfalls and practical solutions for building robust models.

LLM pretrainingTraining Frameworkcurriculum-learning

0 likes · 34 min read

From Zero to One: A Practical Guide to Pretraining Large Language Models

NetEase Cloud Music Tech Team

Apr 15, 2024 · Mobile Development

Implementation and Optimization of Local Private Domain Search in Cloud Music

The Cloud Music team integrated a lightweight on‑device full‑text engine using SQLite FTS5 with a simple tokenizer, replaced JavaScript matching with SQLite’s bm25(), parallelized queries, and cut search latency by 75%, boosting CTR 13% and average playback by 17 seconds while preserving user privacy.

FTS5Full-Text SearchPerformance Optimization

0 likes · 15 min read

Implementation and Optimization of Local Private Domain Search in Cloud Music

政采云技术

Dec 19, 2023 · Backend Development

Principles and Simple Implementation of a Search Engine in Go

This article explains the fundamental concepts of search engine technology—including forward and inverted indexes, tokenizers, stop words, synonym handling, ranking algorithms, and NLP integration—and provides a concise Go implementation with code examples and performance testing.

GoInformation RetrievalNLP

0 likes · 21 min read

Principles and Simple Implementation of a Search Engine in Go

Tencent Cloud Developer

Jul 19, 2023 · Artificial Intelligence

Build a Full‑Scale LLM from Scratch in 61 Lines of Python

This step‑by‑step tutorial shows how to set up a GPU environment, prepare custom text data, train a tokenizer, configure and train a GPT‑2‑based large language model, test its generation, and run the entire pipeline using only 61 lines of Python code.

AIDockerGPT-2

0 likes · 10 min read

Build a Full‑Scale LLM from Scratch in 61 Lines of Python

Tencent Cloud Developer

Feb 20, 2023 · Mobile Development

iOS WeChat Full-Text Search Technology Upgrade: Selection and Optimization

iOS WeChat’s full‑text search was upgraded by selecting SQLite FTS5, creating a VerbatimTokenizer with multi‑level delimiter support, optimizing table formats to cut index size by 30 %, improving batch index updates and parallel search logic, resulting in 40‑60 % faster query latency.

Full-Text SearchIndex OptimizationSQLite FTS5

0 likes · 26 min read

iOS WeChat Full-Text Search Technology Upgrade: Selection and Optimization

WeChat Client Technology Team

Feb 22, 2022 · Mobile Development

How iOS WeChat Supercharged Search with SQLite FTS5 and Custom Tokenizers

This article details the 2021 overhaul of iOS WeChat's full‑text search, covering engine selection, segment‑merge optimization, a new VerbatimTokenizer, multi‑level separator support, table schema choices, asynchronous index updates, and extensive performance gains across chat, contacts, and favorites.

FTS5Full-Text SearchPerformance Optimization

0 likes · 27 min read

How iOS WeChat Supercharged Search with SQLite FTS5 and Custom Tokenizers

Code DAO

Dec 21, 2021 · Artificial Intelligence

Four Keras Techniques for Preprocessing Text for Deep Learning

This article explains four Keras utilities—text_to_word_sequence, hashing_trick, one_hot, and Tokenizer—showing how each converts raw text into token lists, hash indices, integer encodings, or document matrices, with code examples and sample outputs.

KerasText preprocessinghashing_trick

0 likes · 6 min read

Four Keras Techniques for Preprocessing Text for Deep Learning

Tencent IMWeb Frontend Team

Aug 26, 2021 · Frontend Development

How to Fix HTML Entity Bugs That Break Rich Text Rendering

This article explains why HTML entities like "<" and ">" can disappear in rich‑text fields, analyzes the underlying tokenizer state machine, and provides a lightweight hack that inserts empty comment nodes to preserve the original text without breaking legacy rendering logic.

Bug FixEntityHTML

0 likes · 12 min read

How to Fix HTML Entity Bugs That Break Rich Text Rendering

Efficient Ops

Jun 23, 2021 · Backend Development

Why Can’t Elasticsearch Find My Logs? Uncovering Full‑Text Search Pitfalls and Tokenizer Tweaks

This article explains why large‑scale Elasticsearch clusters may miss log entries during keyword searches, dives into the fundamentals of inverted indexes and tokenization, and demonstrates practical index‑time and query‑time tokenizer optimizations—including custom analyzers for English and Chinese—to dramatically improve search recall and precision.

ElasticsearchFull-Text Searchinverted index

0 likes · 13 min read

Why Can’t Elasticsearch Find My Logs? Uncovering Full‑Text Search Pitfalls and Tokenizer Tweaks

MaGe Linux Operations

Jun 1, 2020 · Backend Development

Mastering Elasticsearch Analyzers: A Deep Dive into Tokenizers and Filters

This article explains how Elasticsearch uses Analyzer components—character filters, tokenizers, and token filters—to perform text analysis, reviews the built‑in analyzers such as standard, simple, stop, whitespace, keyword, pattern, language, ICU and IK, and provides practical _analyze API examples with code snippets and result screenshots.

ElasticsearchICU PluginSearch Engine

0 likes · 11 min read

Mastering Elasticsearch Analyzers: A Deep Dive into Tokenizers and Filters

System Architect Go

Sep 3, 2018 · Fundamentals

Understanding Elasticsearch Analyzer, Tokenizer, and Token Filters

This article explains the core components of Elasticsearch's full‑text search analysis—Analyzers, Tokenizers, and Token Filters—detailing their roles, building blocks, built‑in types, and how they combine to customize text processing for effective indexing and querying.

ElasticsearchFull-Text SearchToken Filter

0 likes · 5 min read

Understanding Elasticsearch Analyzer, Tokenizer, and Token Filters