Tagged articles

Tokenization

73 articles · Page 1 of 1

Jul 1, 2026 · Artificial Intelligence

How to Convert Text into Numerical Features for NLP: Tokenization, One‑Hot Encoding, and Word Embedding

This article walks through the essential steps of turning raw natural language into machine‑readable numbers, covering categorical vs. numerical features, one‑hot encoding of categorical data, tokenization, building vocabularies, and using word embeddings, illustrated with an IMDB sentiment‑analysis example in Keras.

Data preprocessingIMDB sentiment analysisKeras

0 likes · 7 min read

How to Convert Text into Numerical Features for NLP: Tokenization, One‑Hot Encoding, and Word Embedding

Lisa Notes

Jun 30, 2026 · Artificial Intelligence

NLP Study Notes: 4 Essential Steps for Preprocessing Chinese Text Corpora

This article walks through the four core steps of Chinese NLP corpus preparation—collecting data, cleaning it with regex and encoding detection, tokenizing using dictionary‑based or statistical methods such as jieba, HMM and CRF, and finally standardizing with stop‑word removal, vocabulary building and one‑hot encoding—while illustrating each step with concrete code snippets and practical examples.

CRFChineseNLP

0 likes · 12 min read

NLP Study Notes: 4 Essential Steps for Preprocessing Chinese Text Corpora

Lisa Notes

Jun 29, 2026 · Artificial Intelligence

NLP Basics: Core Concepts, Task Types, and Preprocessing Steps

The article introduces Natural Language Processing as an AI subfield, outlines its four main task categories—classification to sequence, sequence to classification, synchronous and asynchronous seq‑to‑seq—and details the typical preprocessing pipeline including corpus collection, cleaning, tokenization, stemming, lemmatization, POS tagging, NER, and chunking.

NLPPreprocessingTask Types

0 likes · 3 min read

NLP Basics: Core Concepts, Task Types, and Preprocessing Steps

Lisa Notes

Jun 27, 2026 · Artificial Intelligence

Getting Started with Stanford CoreNLP: Tokenization, POS, NER, and Parsing

This guide introduces Stanford CoreNLP, a Python interface for fundamental NLP tasks such as tokenization, part‑of‑speech tagging, named‑entity recognition, constituency and dependency parsing, showing installation steps, model download links, and example outputs.

NLPNamed Entity RecognitionPOS tagging

0 likes · 4 min read

Getting Started with Stanford CoreNLP: Tokenization, POS, NER, and Parsing

Lisa Notes

Jun 21, 2026 · Artificial Intelligence

Understanding Byte Pair Encoding (BPE): A Greedy Subword Compression Algorithm for NLP

The article explains how Byte Pair Encoding (BPE) works as a greedy, linear‑time subword segmentation technique, walks through its step‑by‑step token merging process with a concrete sentence example, discusses its strengths in handling OOV words, and outlines its limitations and alternatives such as WordPiece and SentencePiece.

BPEByte Pair EncodingNLP

0 likes · 8 min read

Understanding Byte Pair Encoding (BPE): A Greedy Subword Compression Algorithm for NLP

AI Architecture Hub

Jun 4, 2026 · Artificial Intelligence

10 Essential AI Concepts Every Developer Must Master

This article explains ten core AI concepts—including tokens, embeddings, attention, the Transformer architecture, large language models, hallucination, temperature, context windows, Retrieval‑Augmented Generation, and AI agents—so developers can understand model behavior, avoid common pitfalls, and build reliable AI applications.

AI FundamentalsAI agentsRAG

0 likes · 15 min read

10 Essential AI Concepts Every Developer Must Master

Machine Learning Algorithms & Natural Language Processing

Jun 3, 2026 · Artificial Intelligence

Can Multimodal Models Ditch Frame Sampling? LLaVA‑OneVision‑2.0’s Codec‑Stream

LLaVA‑OneVision‑2.0 replaces uniform frame sampling with a codec‑stream visual unit, integrates a OneVision‑Encoder that tokenizes video as state‑plus‑incremental evidence, and demonstrates consistent gains on 18 video, 11 spatial‑reasoning and 4 tracking benchmarks while open‑sourcing its model, data and code.

JumpScoreLLaVA-OneVision-2.0Multimodal

0 likes · 17 min read

Can Multimodal Models Ditch Frame Sampling? LLaVA‑OneVision‑2.0’s Codec‑Stream

Machine Heart

May 28, 2026 · Artificial Intelligence

Why Google’s AI Can’t Count the Letters in Its Own Name

The article examines why the newly AI‑powered Google Search fails at simple letter‑count questions like “how many P’s are in Google,” tracing the issue to token‑based language models, illustrating it with examples, and discussing both short‑term prompts and long‑term architectural solutions such as byte‑level models.

Google SearchJagged IntelligenceLLM

0 likes · 13 min read

Why Google’s AI Can’t Count the Letters in Its Own Name

AI Engineer Programming

May 21, 2026 · Artificial Intelligence

RAG with Multimodal Inputs vs LLM + Toolchains: Handling Non‑Text Data

The article analyzes how large language models process only tokenized text, compares the traditional LLM‑plus‑toolchain pipeline with emerging multimodal models, evaluates their cost, speed, controllability, and hallucination risks, and proposes a hybrid architecture that matches each approach to specific document scenarios.

LLMMultimodalRAG

0 likes · 16 min read

RAG with Multimodal Inputs vs LLM + Toolchains: Handling Non‑Text Data

Architects' Tech Alliance

May 8, 2026 · Artificial Intelligence

Token Fundamentals: A Technical Panorama of AI Language Units

Tokens are the smallest language building blocks that AI models process, representing characters, words, subwords, punctuation or emojis; they determine context window size and generation speed, so tokenization directly impacts model understanding accuracy and efficiency, as explained in the 2026 Token Report.

AI FundamentalsLanguage ModelsModel Efficiency

0 likes · 4 min read

Token Fundamentals: A Technical Panorama of AI Language Units

Ops Community

Apr 21, 2026 · Artificial Intelligence

How to Tame Unstable LLM Prompts: Causes and Fixes

This article explains why large‑model prompts can yield inconsistent answers, examines the roles of temperature, top‑p/top‑k, tokenization, context windows, position bias, and model randomness, and provides a step‑by‑step debugging workflow and production‑grade best‑practice checklist to achieve stable outputs.

LLM stabilityPrompt engineeringTemperature

0 likes · 13 min read

How to Tame Unstable LLM Prompts: Causes and Fixes

Geek Labs

Apr 20, 2026 · Artificial Intelligence

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

This open‑source tutorial breaks down large language model internals into 11 detailed topics—covering BPE tokenization, attention mathematics, backpropagation, transformer architecture, KV‑Cache, Paged and Flash Attention, and frontier techniques—each with numeric derivations and Python code, making it ideal for developers and interview preparation.

Flash AttentionInference OptimizationKV cache

0 likes · 5 min read

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

Architect's Tech Stack

Apr 18, 2026 · Artificial Intelligence

What’s New in Claude Opus 4.7? Deep Dive into Capabilities and Migration Tips

Anthropic’s Claude Opus 4.7 launches with enhanced handling of complex, long‑running tasks, higher‑resolution visual analysis, stricter instruction compliance, improved benchmark scores, expanded file‑system memory, new effort levels (xhigh), API task‑budget beta, reinforced security measures, and migration guidance on tokenization and prompt adjustments.

AI modelAnthropicClaude Opus

0 likes · 4 min read

What’s New in Claude Opus 4.7? Deep Dive into Capabilities and Migration Tips

AgentGuide

Apr 12, 2026 · Artificial Intelligence

What Is a Token? A Deep Dive into Tokenization Algorithms for LLMs

The article defines tokens (now officially called “词元”), explains why large language models require numeric input, and details three main tokenization strategies—word‑based, character‑based, and subword—along with the sub‑methods BPE, WordPiece, and Unigram, highlighting their advantages and drawbacks.

BPELLMTokenization

0 likes · 6 min read

What Is a Token? A Deep Dive into Tokenization Algorithms for LLMs

Code Mala Tang

Apr 7, 2026 · Artificial Intelligence

Demystifying LLMs: From Tokens to Agents – An Engineer’s Deep Dive

This article provides a comprehensive, engineering‑focused breakdown of large language models, covering their Transformer roots, tokenization, context windows, prompt engineering, tool integration via MCP, and autonomous agents, while offering practical examples and actionable insights for developers.

AI FundamentalsAgentLLM

0 likes · 10 min read

Demystifying LLMs: From Tokens to Agents – An Engineer’s Deep Dive

AI Programming Lab

Apr 5, 2026 · Artificial Intelligence

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

The article explains what tokens are, how different models tokenize text, the role of token embeddings, positional encoding, self‑attention, KV cache, and why output tokens cost far more than input tokens, while also covering pricing differences and prompt‑caching savings across major LLM providers.

KV cacheLLM pricingLarge Language Model

0 likes · 13 min read

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

AI Info Trend

Apr 2, 2026 · Industry Insights

What Will Shape FinTech in 2026? 9 Key Predictions Unveiled

The 2026 CB Insights FinTech report forecasts nine major trends—including new‑bank entrants, BNPL giants entering banking, Robinhood’s super‑app shift, crypto firms targeting banks, stablecoin‑driven AI payments, and the rise of trusted‑data prediction markets—offering a data‑driven roadmap for industry players and users alike.

AI agentsBNPLFinTech

0 likes · 10 min read

What Will Shape FinTech in 2026? 9 Key Predictions Unveiled

Machine Learning Algorithms & Natural Language Processing

Mar 31, 2026 · Artificial Intelligence

Unified Multimodal Modeling: How LongCat-Next Bridges Understanding and Generation

The article analyzes why text models naturally combine understanding and generation, explains the fundamental conflicts that prevent images from sharing the same tokenization, and details LongCat-Next’s discrete autoregressive approach—using SAE visual encoders, residual vector quantization, and a unified LLM backbone—to achieve a single model that can both comprehend and create multimodal content.

LongCat-NextRVQTokenization

0 likes · 21 min read

Unified Multimodal Modeling: How LongCat-Next Bridges Understanding and Generation

PaperAgent

Mar 30, 2026 · Artificial Intelligence

How LongCat-Next Redefines Multimodal AI with Discrete Tokens

The LongCat-Next model from Meituan introduces a native multimodal architecture that uses discrete tokenization for vision and audio, achieving unified understanding and generation across modalities while delivering state‑of‑the‑art benchmark performance and simplifying training pipelines.

AIMeituanTokenization

0 likes · 11 min read

How LongCat-Next Redefines Multimodal AI with Discrete Tokens

Architecture Digest

Mar 24, 2026 · Databases

How to Perform Fuzzy Searches on Encrypted Data: Methods, Pros, and Cons

This article examines why encrypted data hampers fuzzy queries, categorizes three implementation approaches—from naïve in‑memory decryption to conventional token‑based indexing and advanced algorithmic schemes—evaluates their performance, storage overhead, and security trade‑offs, and provides practical references.

Fuzzy SearchTokenizationsecurity

0 likes · 10 min read

How to Perform Fuzzy Searches on Encrypted Data: Methods, Pros, and Cons

Full-Stack Cultivation Path

Mar 23, 2026 · Artificial Intelligence

What Exactly Is a Token in LLMs? A First‑Principles Explanation

The article explains that a token is the smallest discrete text unit a large language model processes, detailing why tokenization is essential, how tokenizers work, how tokens flow through the transformer, and how token counts affect context windows, cost, latency, and overall model behavior.

EmbeddingLLMTokenization

0 likes · 20 min read

What Exactly Is a Token in LLMs? A First‑Principles Explanation

Weekly Large Model Application

Mar 20, 2026 · Artificial Intelligence

Inside GLM-4-Voice: An End-to-End Chinese-English Speech Dialogue Model

GLM-4-Voice is an end-to-end Chinese-English speech dialogue model that aligns discrete speech tokens with GLM-4-9B, uses VQ-based tokenization at 12.5 token/s, supports emotion, tone, speed and dialect control, and offers streaming inference with low latency, while detailing its architecture, advantages, limitations and suitable use cases.

GLM-4-VoiceMultimodal AITokenization

0 likes · 10 min read

Inside GLM-4-Voice: An End-to-End Chinese-English Speech Dialogue Model

Data STUDIO

Feb 25, 2026 · Artificial Intelligence

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

This guide walks you through building, training, and fine‑tuning a Transformer‑based large language model entirely from scratch using PyTorch, covering tokenization, self‑attention, multi‑head attention, positional encoding, model architecture, data preparation, training loops, and fine‑tuning on custom lyrics.

GPTLLMPyTorch

0 likes · 43 min read

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

AI Architecture Hub

Dec 24, 2025 · Artificial Intelligence

From LLMs to Autonomous Agents: The Three Evolution Stages of AI

This article explains the three evolutionary stages of AI—from large language models that generate text, through workflow‑enhanced systems using retrieval‑augmented generation, to fully autonomous agents capable of self‑directed decision‑making—while detailing the four core technologies that power each stage.

AI evolutionAgentEmbedding

0 likes · 9 min read

From LLMs to Autonomous Agents: The Three Evolution Stages of AI

Architect

Dec 15, 2025 · Artificial Intelligence

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

This comprehensive guide explains the fundamentals of large language model (LLM) architectures, covering the original Transformer, tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a step‑by‑step translation example, and the latest open‑source and hybrid LLM designs shaping the field.

EmbeddingLLMMoE

0 likes · 41 min read

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

Tencent Cloud Developer

Dec 9, 2025 · Artificial Intelligence

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

This article walks through the complete workflow of AI large language models, from turning user queries into token matrices via tokenization and embedding, through the Transformer’s self‑attention and multi‑head mechanisms, to decoding logits into human‑readable text, while also covering position encoding, long‑context strategies, generation parameters, and practical engineering tips.

Inference OptimizationSelf-AttentionTokenization

0 likes · 29 min read

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

ShiZhen AI

Dec 1, 2025 · Artificial Intelligence

AI Comic Episode 3: What Exactly Is a Token?

This episode explains that a token is the smallest text chunk an LLM processes—ranging from characters to subwords—covers why subword tokenization avoids vocabulary explosion, compares token counts across languages, describes the computational cost of sequential generation, and introduces visual tokens for multimodal models.

AI FundamentalsMultimodalTokenization

0 likes · 7 min read

AI Comic Episode 3: What Exactly Is a Token?

HyperAI Super Neural

Nov 24, 2025 · Artificial Intelligence

Introducing AION-1: The First Astronomical Multimodal Foundation Model Trained on 200M Targets

AION-1, developed by a consortium including UC Berkeley, Cambridge and Oxford, is the first large‑scale multimodal foundation model for astronomy that unifies images, spectra and catalog data via an early‑fusion backbone, achieving zero‑shot and linear‑probe performance that rivals or surpasses task‑specific models across diverse scientific tasks.

Multimodal AITokenizationastronomy

0 likes · 18 min read

Introducing AION-1: The First Astronomical Multimodal Foundation Model Trained on 200M Targets

Huawei Cloud Developer Alliance

Oct 24, 2025 · Artificial Intelligence

Large Model Essentials: Parameters, Tokens, Context Window & Temperature

This article breaks down five fundamental concepts of large AI models—parameter count, tokenization, context window, context length, and temperature—explaining their impact on model capability, computational cost, generation quality, and how to balance them for optimal performance.

AITemperatureTokenization

0 likes · 7 min read

Large Model Essentials: Parameters, Tokens, Context Window & Temperature

Volcano Engine Developer Services

Sep 28, 2025 · Artificial Intelligence

Demystifying AI Jargon: A Beginner’s Guide to Large Language Models

This guide breaks down the complex terminology of large language models—explaining tokens, transformers, self‑attention, RAG, scaling laws, dense vs. sparse architectures, and training stages—using clear analogies and step‑by‑step explanations so readers can confidently understand and work with modern AI systems.

AI FundamentalsModel TrainingRAG

0 likes · 35 min read

Demystifying AI Jargon: A Beginner’s Guide to Large Language Models

Bighead's Algorithm Notes

Sep 7, 2025 · Artificial Intelligence

Paper Review: Kronos – A Temporal Foundation Model for Financial Market Language

This article reviews Kronos, a unified and scalable pre‑training framework designed for financial K‑line data, detailing its tokenization approach, autoregressive architecture, large‑scale pre‑training on 12 billion records, and experimental results that show substantial gains in price prediction, volatility forecasting, synthetic data generation, and investment simulation.

KronosTokenizationautoregressive pretraining

0 likes · 9 min read

Paper Review: Kronos – A Temporal Foundation Model for Financial Market Language

Qborfy AI

Aug 16, 2025 · Artificial Intelligence

Mastering LLM Tokens: How They Work, Cost, and Choose the Right Model

This article explains what tokens are in large language models, how they are counted and priced, compares tokenization methods across major models, and provides practical guidelines and code examples for optimizing token usage and selecting the appropriate model for different scenarios.

AILLMPrompt engineering

0 likes · 8 min read

Mastering LLM Tokens: How They Work, Cost, and Choose the Right Model

Qborfy AI

Aug 12, 2025 · Artificial Intelligence

What Powers Large Language Models? A Deep Dive into LLM Architecture and Scaling

This article explains how massive Transformer‑based large language models compress text data into mathematical representations, why scale, self‑attention, and training paradigms enable emergent general intelligence, and walks through tokenization, embedding, multi‑layer attention, architecture choices, energy costs, and hallucination mitigation.

AIEmbeddingLLM

0 likes · 6 min read

What Powers Large Language Models? A Deep Dive into LLM Architecture and Scaling

Data Party THU

Aug 5, 2025 · Artificial Intelligence

Why State Space Models May Outperform Transformers: A Deep Dive

The article provides a comprehensive technical analysis of state space models (SSM) versus Transformers, covering their core mechanisms, three essential design factors, training efficiency, scaling behavior, tokenization debates, and experimental evidence that highlights the trade‑offs and potential advantages of SSMs in modern AI systems.

MambaState Space ModelTokenization

0 likes · 21 min read

Why State Space Models May Outperform Transformers: A Deep Dive

Ops Development & AI Practice

Aug 4, 2025 · Blockchain

Why ERC-1400 Is the Key to Compliant Security Tokens on Ethereum

The article explains how ERC-1400 extends ERC-20 with built‑in compliance features—such as KYC checks, transfer restrictions, tranche handling, on‑chain document storage, and forced transfer mechanisms—to enable legally compliant tokenization of real‑world assets like equity, bonds, and real‑estate.

ERC-1400EthereumRegulation

0 likes · 7 min read

Why ERC-1400 Is the Key to Compliant Security Tokens on Ethereum

AI Frontier Lectures

Jul 24, 2025 · Artificial Intelligence

State Space Models vs Transformers: Uncovering the Real Trade‑offs in Sequence Modeling

This article analyzes the fundamental differences between state space models (SSM) and Transformer architectures, highlighting their three core components, training efficiency, memory handling, tokenization impact, and empirical performance trade‑offs, and argues why SSMs can outperform Transformers on many sequence tasks.

AI ArchitectureTokenizationTransformers

0 likes · 19 min read

State Space Models vs Transformers: Uncovering the Real Trade‑offs in Sequence Modeling

Tencent Technical Engineering

May 12, 2025 · Artificial Intelligence

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

This article provides a detailed Chinese‑to‑English summary of Andrej Karpathy’s 7‑hour LLM tutorial, covering chat process analysis, tokenization, pre‑training data pipelines, model architecture, training strategies, post‑training fine‑tuning, reinforcement learning, chain‑of‑thought reasoning, and current industry applications.

AILLMTokenization

0 likes · 25 min read

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

AI2ML AI to Machine Learning

Apr 17, 2025 · Artificial Intelligence

Inside Qwen: A Deep Dive into the Large Model’s Source Code

The article provides a comprehensive technical walkthrough of Qwen’s large‑model series, covering data preparation, tokenization, model tweaks, training settings, RLHF pipeline, Code‑Qwen specifics, Qwen2 and Qwen3 architectural changes, scaling‑law experiments, and detailed source‑code analysis with illustrative diagrams.

Large Language ModelMoEQwen

0 likes · 7 min read

Inside Qwen: A Deep Dive into the Large Model’s Source Code

DevOps

Apr 13, 2025 · Artificial Intelligence

The Amazing Magic of GPT‑4o and a Speculative Technical Roadmap

This article reviews the breakthrough image‑generation capabilities of GPT‑4o, showcases diverse examples, and offers a detailed speculation on its underlying autoregressive architecture, tokenization methods, VQ‑VAE/GAN advances, and training strategies that could explain its performance.

AI researchGPT-4oTokenization

0 likes · 16 min read

The Amazing Magic of GPT‑4o and a Speculative Technical Roadmap

58 Tech

Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphMultimodalQuantization

0 likes · 19 min read

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

AI Algorithm Path

Apr 10, 2025 · Artificial Intelligence

Beginner-Friendly Guide to Understanding Large Language Models

This article walks readers through the fundamentals of large language models, covering what tokens are, how tokenization works, the conversion of tokens to numeric IDs, the transformer architecture—including positional encoding, self‑attention, feed‑forward networks and softmax—and explains how these components enable next‑token prediction.

EmbeddingLLMSelf-Attention

0 likes · 9 min read

Beginner-Friendly Guide to Understanding Large Language Models

Code Mala Tang

Mar 27, 2025 · Artificial Intelligence

How Do BPE, WordPiece, and SentencePiece Shape Modern NLP Tokenization?

This article explains the fundamentals, workflows, examples, and trade‑offs of three major subword tokenization algorithms—Byte Pair Encoding, WordPiece, and SentencePiece—helping practitioners choose the right method for their large language model pipelines.

BPENLPSentencePiece

0 likes · 12 min read

How Do BPE, WordPiece, and SentencePiece Shape Modern NLP Tokenization?

NewBeeNLP

Jan 17, 2025 · Artificial Intelligence

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

This comprehensive survey examines the foundations, tokenization techniques, model architectures, training paradigms, evaluation benchmarks, and open challenges of multimodal next‑token prediction (MMNTP), offering researchers a clear roadmap for future advances in multimodal AI.

EvaluationMultimodal AINext Token Prediction

0 likes · 9 min read

Unlocking Multimodal Intelligence: A Deep Dive into Next Token Prediction

Alibaba Cloud Developer

Dec 17, 2024 · Frontend Development

Choosing the Best LangChain Text Splitter for Frontend LLM Apps

This article compares five LangChain text splitters—CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter, MarkdownTextSplitter, and LatexTextSplitter—by examining their principles, pros and cons, and ideal use cases, helping developers select the most suitable splitter for their frontend large‑model applications.

JavaScriptLLMLangChain

0 likes · 10 min read

Choosing the Best LangChain Text Splitter for Frontend LLM Apps

DaTaobao Tech

Dec 9, 2024 · Artificial Intelligence

Analyzing LLM Failure Cases: Tokenization, Next‑Token Prediction, and Chain‑of‑Thought Prompting

The article explains how tokenization mismatches and biased next‑token prediction cause LLMs to miscount letters in “Strawberry” and incorrectly compare 9.9 versus 9.11, and shows that step‑by‑step Chain‑of‑Thought prompting with reason‑first output dramatically improves accuracy.

AIChain-of-ThoughtLLM

0 likes · 13 min read

Analyzing LLM Failure Cases: Tokenization, Next‑Token Prediction, and Chain‑of‑Thought Prompting

Infra Learning Club

Oct 31, 2024 · Artificial Intelligence

What Is a Token in Large Language Models?

The article explains that a token is the unit processed by large language models, describes three common tokenizer methods—word‑level, character‑level, and sub‑word level—with English and Chinese examples, discusses their advantages and limitations, and shows how OpenAI’s tokenizer varies across model versions.

NLPTokenTokenization

0 likes · 5 min read

What Is a Token in Large Language Models?

Java Tech Enthusiast

Sep 15, 2024 · Fundamentals

How Source Code Is Transformed into Machine Instructions

A compiler transforms source code into executable machine instructions by first tokenizing the text into keywords, identifiers and literals, then parsing these tokens into an abstract syntax tree, generating and optimizing intermediate code, and finally assembling and linking the output for the target architecture or LLVM IR.

ASTLLVMMachine Code

0 likes · 4 min read

How Source Code Is Transformed into Machine Instructions

Liangxu Linux

Aug 20, 2024 · Fundamentals

How Does Code Transform Into Machine Instructions? A Step‑by‑Step Compiler Guide

This article walks through how a compiler turns human‑readable source code into binary machine instructions, covering tokenization, parsing, abstract syntax tree construction, code generation, optimization, and linking, while highlighting the role of LLVM as a portable backend.

ASTLLVMLinker

0 likes · 5 min read

How Does Code Transform Into Machine Instructions? A Step‑by‑Step Compiler Guide

21CTO

Aug 11, 2024 · Artificial Intelligence

Demystifying LLMs: How Tokens, Training, and Transformers Power Generative AI

This article explains the fundamentals of large language models, covering tokenization, probability prediction, Markov chain basics, training data limitations, context windows, and the transition to neural network architectures like Transformers, while providing Python examples and insights into model scaling and the illusion of intelligence.

AILLMTokenization

0 likes · 18 min read

Demystifying LLMs: How Tokens, Training, and Transformers Power Generative AI

Architect

Aug 11, 2024 · Artificial Intelligence

Understanding Large Language Models: Tokens, Tokenization, and the Evolution from Markov Chains to Transformers

This article explains how generative AI models work by demystifying tokens, tokenization with tools like tiktoken, simple Markov‑chain training, the limitations of small context windows, and how modern LLMs use neural networks, transformers and attention mechanisms to predict the next token.

LLMMarkov chainTokenization

0 likes · 20 min read

Understanding Large Language Models: Tokens, Tokenization, and the Evolution from Markov Chains to Transformers

Java Architect Essentials

Aug 1, 2024 · Backend Development

Implementing Fuzzy Company Name Matching with MySQL RegExp in a Business Approval Workflow

This article describes a business approval scenario where a company name entered by a business user must be checked for duplicates, and explains how to implement fuzzy matching using MySQL RegExp, tokenization with IKAnalyzer, and Java service code to extract, preprocess, match, and rank results by relevance.

JavaTokenizationbackend

0 likes · 11 min read

Implementing Fuzzy Company Name Matching with MySQL RegExp in a Business Approval Workflow

IT Services Circle

Jul 17, 2024 · Artificial Intelligence

Why Large Language Models Mistake 9.11 > 9.9: Prompting, Tokenizer Effects, and Recent Findings

The article examines why leading large language models such as GPT‑4o, Gemini Advanced, and Claude 3.5 incorrectly claim that 9.11 is larger than 9.9, analyzes tokenization and prompting strategies that cause the error, and discusses recent research and OpenAI model updates.

AI reasoningNumerical ComparisonPrompt engineering

0 likes · 7 min read

Why Large Language Models Mistake 9.11 > 9.9: Prompting, Tokenizer Effects, and Recent Findings

Java Tech Enthusiast

Jul 16, 2024 · Artificial Intelligence

LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9

Recent tests reveal that popular large language models—including GPT‑4o, Gemini Advanced, and Claude 3.5—often claim 9.11 is larger than 9.9 because their tokenizers split the numbers, but rephrasing, zero‑shot chain‑of‑thought prompts, or treating the values as floating‑point numbers can correct the mistake, a pattern also seen variably in Chinese models.

AI evaluationLLMPrompt engineering

0 likes · 7 min read

LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9

JD Cloud Developers

Jun 25, 2024 · Artificial Intelligence

Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics

This article explains the fundamental architecture of large language models, from the dual file nature of parameters and code, through neural network basics, perceptrons, and weight training, to the Transformer’s tokenization, positional encoding, self‑attention, and inference processes, illustrated with diagrams and examples.

Large Language ModelNeural NetworkSelf-Attention

0 likes · 22 min read

Why Do Large Language Models Output Text Word‑by‑Word? Inside the Transformer Mechanics

Rare Earth Juejin Tech Community

Dec 15, 2023 · Artificial Intelligence

AIGC Tutorial: Tokenization, POS Tagging, and Named Entity Recognition with Transformers, NLTK, and spaCy

This tutorial introduces AIGC concepts and walks through practical implementations of tokenization, part‑of‑speech tagging, and named entity recognition using the Transformers library, NLTK, and spaCy on Google Colab, complete with code snippets and visual results.

AIGCNLPNLTK

0 likes · 10 min read

AIGC Tutorial: Tokenization, POS Tagging, and Named Entity Recognition with Transformers, NLTK, and spaCy

Python Programming Learning Circle

Nov 17, 2023 · Big Data

Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python

This article demonstrates how to implement a basic big‑data search engine in Python by creating a Bloom filter for fast existence checks, designing tokenization functions for major and minor segmentation, building an inverted index, and supporting AND/OR queries with example code and execution results.

Big DataSearchTokenization

0 likes · 12 min read

Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python

Meituan Technology Team

Sep 22, 2022 · Information Security

Tokenization for Data Security: Design, Implementation, and Engineering Practices

The article explains how tokenization transforms data security into a built‑in attribute that automatically scales with data growth, detailing its design principles, generation methods, architectural layers, security safeguards, and practical engineering experiences to address exposure risks in modern digital businesses.

Data GovernanceData SecurityPII

0 likes · 24 min read

Tokenization for Data Security: Design, Implementation, and Engineering Practices

Programmer DD

Aug 30, 2022 · Artificial Intelligence

How to Build a Custom HanLP Analyzer Plugin for Elasticsearch with Nginx

This guide walks through setting up a Java GraalVM 17 environment, installing Nginx to serve static dictionary files, configuring a HanLP‑based Elasticsearch analyzer plugin, packaging and deploying it, and testing the analyzer with JUnit5 and curl commands.

ElasticsearchHanLPJava

0 likes · 14 min read

How to Build a Custom HanLP Analyzer Plugin for Elasticsearch with Nginx

Baidu Geek Talk

Mar 21, 2022 · Frontend Development

How WebKit Parses HTML: Decoding, Tokenization, and DOM Tree Construction

The article details WebKit’s rendering pipeline in WKWebView, describing how the network process streams HTML bytes to the rendering process, which decodes them via TextResourceDecoder, tokenizes the characters with HTMLTokenizer’s state machine, and constructs an efficient DOM tree using HTMLTreeBuilder and queued insertion tasks.

DOMTokenizationWebKit

0 likes · 33 min read

How WebKit Parses HTML: Decoding, Tokenization, and DOM Tree Construction

Baidu App Technology

Mar 7, 2022 · Mobile Development

How WKWebView Parses HTML: Decoding, Tokenization, and DOM Tree Construction

WKWebView parses HTML by streaming bytes from the network process to the rendering process, decoding them into characters, tokenizing into HTML tokens, building a DOM tree through node creation and insertion, and finally laying out and painting the document using a doubly‑linked in‑memory structure.

DOMTokenizationWKWebView

0 likes · 37 min read

How WKWebView Parses HTML: Decoding, Tokenization, and DOM Tree Construction

Aikesheng Open Source Community

Jun 23, 2021 · Databases

Using MySQL Ngram Plugin to Enable Accurate Full‑Text Search for Chinese Text

This article explains why MySQL's default full‑text index struggles with Chinese, demonstrates how to configure token size parameters, activate the ngram parser plugin, and adjust queries (including Boolean mode) to achieve reliable Chinese full‑text search results.

Boolean ModeChineseFull-Text Search

0 likes · 12 min read

Using MySQL Ngram Plugin to Enable Accurate Full‑Text Search for Chinese Text

Python Crawling & Data Mining

Jun 16, 2021 · Artificial Intelligence

Master Chinese Text Segmentation with jieba: Installation, Modes, and Advanced Tricks

This tutorial walks you through installing the jieba Python library, explains its three segmentation modes—precise, full, and search—demonstrates how to add or delete words, manage custom dictionaries, handle stop words, perform weight analysis, adjust word frequencies, and retrieve token positions, all with clear code examples and visual output.

NLPPythonTokenization

0 likes · 10 min read

Master Chinese Text Segmentation with jieba: Installation, Modes, and Advanced Tricks

360 Quality & Efficiency

Jul 3, 2020 · Backend Development

Understanding PHP_CodeSniffer: Tokenization, Lexical Analysis, and Custom Rule Creation

This article explains how PHP_CodeSniffer parses PHP source code into tokens using lexical analysis, demonstrates token extraction with token_get_all, and guides readers through creating a custom rule to prohibit hash‑style comments, covering rule library setup, Sniff implementation, and execution.

Custom RulesPHPPHP_CodeSniffer

0 likes · 12 min read

Understanding PHP_CodeSniffer: Tokenization, Lexical Analysis, and Custom Rule Creation

Java High-Performance Architecture

Jan 20, 2020 · Fundamentals

How Inverted Indexes Power Fast Full-Text Search

This article explains what an inverted index is, why it’s essential for full‑text search, how it is built and queried, and common token transformations such as stop‑word removal, lemmatization, and stemming.

Full-Text SearchInformation RetrievalSearch Engine

0 likes · 4 min read

How Inverted Indexes Power Fast Full-Text Search

WecTeam

Oct 24, 2019 · Fundamentals

How to Build a JavaScript Lexer for Arithmetic Expressions Using a Finite State Machine

This article explains how to implement a lexical analyzer in JavaScript that tokenizes simple arithmetic expressions by using a finite state machine, covering the conversion from infix notation to an abstract syntax tree, token definitions, state transitions, and complete source code examples.

ASTFinite State MachineJavaScript

0 likes · 9 min read

How to Build a JavaScript Lexer for Arithmetic Expressions Using a Finite State Machine

Java Captain

Feb 17, 2019 · Backend Development

Implementing a JSON Parser in Java: Structures, Tokenization, and Parsing

This article explains the fundamentals of JSON, its object and array structures, maps JSON types to Java equivalents, and provides a complete Java implementation of a JSON parser including token definitions, lexical analysis, and object/array construction with detailed code examples.

JavaParserTokenization

0 likes · 14 min read

Implementing a JSON Parser in Java: Structures, Tokenization, and Parsing

MaGe Linux Operations

Jan 23, 2019 · Big Data

How Bloom Filters Power Fast Big Data Searches with Python

This tutorial walks through building a simple Python search engine for big data, covering Bloom filter basics, tokenization with major and minor segmentation, inverted index creation, and implementing both simple and complex (AND/OR) queries, complete with code examples and visual illustrations.

AND/OR queriesBig DataPython

0 likes · 15 min read

How Bloom Filters Power Fast Big Data Searches with Python

MaGe Linux Operations

Nov 27, 2018 · Big Data

How a Simple Python Bloom Filter Powers Fast Big Data Search

This article demonstrates how to implement a basic Bloom filter, tokenization, and inverted index in Python to illustrate the core principles of big‑data search, including fast negative lookups, term segmentation, and support for AND/OR queries.

AND/OR queriesTokenizationbig data search

0 likes · 13 min read

How a Simple Python Bloom Filter Powers Fast Big Data Search

High Availability Architecture

Jun 4, 2018 · Blockchain

Key Competitive Points of Public Chains and Insights from the GIAC Shenzhen Conference

The article summarizes the author’s takeaways from the GIAC Shenzhen conference, analyzing various public‑chain projects, their architectural choices, competitive focuses such as scalability, asset tokenization, DApp support, and the role of alliance chains in finance, traceability, and anti‑counterfeiting.

TokenizationUTXOalliance chain

0 likes · 8 min read

Key Competitive Points of Public Chains and Insights from the GIAC Shenzhen Conference

Architects Research Society

Mar 31, 2018 · Blockchain

Understanding Blockchain Technology: Records, Identity, Smart Contracts, and Real‑World Applications

This article explains how blockchain, as a decentralized and trust‑less ledger, enables digital identity, tokenized assets, smart contracts, automated governance, and streamlined settlement across sectors such as finance, government, and healthcare, while also discussing its challenges and future potential.

GovernanceTokenizationdecentralized ledger

0 likes · 7 min read

Understanding Blockchain Technology: Records, Identity, Smart Contracts, and Real‑World Applications

MaGe Linux Operations

Dec 3, 2017 · Big Data

Build a Simple Big Data Search Engine with Bloom Filters and Tokenization in Python

This article walks through implementing a basic big‑data search system in Python, covering Bloom filter basics, tokenization of text, inverted index construction, and how to combine these techniques to support fast AND/OR queries.

Big DataPythonSearch

0 likes · 13 min read

Build a Simple Big Data Search Engine with Bloom Filters and Tokenization in Python

StarRing Big Data Open Lab

Sep 15, 2017 · Big Data

Boost Full‑Text Search with Search SQL: Tokenization, CONTAINS, NEAR & FUZZY

This article explains how Search SQL enables easy full‑text search on Transwarp Search by using standard SQL syntax, covering tokenization, analyzer configuration, CONTAINS queries, and advanced NEAR and FUZZY operators to improve performance and query semantics.

FUZZYFull-Text SearchNEAR

0 likes · 9 min read

Boost Full‑Text Search with Search SQL: Tokenization, CONTAINS, NEAR & FUZZY

ITPUB

Dec 23, 2015 · Artificial Intelligence

How Computers Turn Words into Numbers: A Beginner’s Guide to Tokenization and Vector Similarity

This article explains how natural language processing stores word meanings as numeric vectors, builds token dictionaries, represents sentences as binary vectors, and uses dot‑product calculations to measure similarity, illustrating concepts with simple examples and highlighting current limitations and future directions.

NLPTokenizationartificial-intelligence

0 likes · 7 min read

How Computers Turn Words into Numbers: A Beginner’s Guide to Tokenization and Vector Similarity