Fundamentals 5 min read

Understanding Elasticsearch Analyzer, Tokenizer, and Token Filters

This article explains the core components of Elasticsearch's full‑text search analysis—Analyzers, Tokenizers, and Token Filters—detailing their roles, building blocks, built‑in types, and how they combine to customize text processing for effective indexing and querying.

System Architect Go
System Architect Go
System Architect Go
Understanding Elasticsearch Analyzer, Tokenizer, and Token Filters

In Elasticsearch (ES), full‑text search relies on analysis; an Analyzer splits text into tokens, enabling the creation of an inverted index.

Analyzer consists of three building blocks:

Character filters – process the raw character stream (e.g., remove HTML tags, replace characters, apply regex).

Tokenizer – divides the character stream into individual tokens (words or terms).

Token filters – further modify the token stream (add, delete, or transform tokens).

Not every Analyzer must include all three components; a typical Analyzer may have zero or more character filters, exactly one tokenizer, and zero or more token filters.

Analyzer workflow diagram
Analyzer workflow diagram

Character Filters can remove HTML elements, replace specific characters (e.g., abc => 123), or apply regular‑expression replacements. ES provides three built‑in character filter types (illustrated in the following image).

Built‑in character filters
Built‑in character filters

Tokenizer splits the character stream into tokens according to defined rules. The article clarifies the difference between words, letters, and characters, and between tokens and terms (as defined in Lucene). ES includes fifteen built‑in tokenizers, grouped into three categories:

Word‑oriented tokenizers

Tokenizers that operate on parts of a token

Structured‑text tokenizers

Word‑oriented tokenizers
Word‑oriented tokenizers
Part‑of‑token tokenizers
Part‑of‑token tokenizers
Structured‑text tokenizers
Structured‑text tokenizers

Token Filters take the token stream produced by the tokenizer and apply further transformations (add, delete, or modify tokens). ES ships with dozens of built‑in token filters (illustrated below).

Token filters overview
Token filters overview

ES also provides a collection of built‑in analyzers that combine character filters, a tokenizer, and token filters. By selecting and combining the desired components, users can easily create custom analyzers suited to their specific text‑processing needs.

Beyond the built‑in options, many open‑source analyzers (e.g., the IK analyzer for Chinese) are available for specialized use cases.

This article offers a concise overview of the analysis stage in Elasticsearch’s full‑text search pipeline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchFull‑Text SearchTokenizertext analysisanalyzerToken Filter
System Architect Go
Written by

System Architect Go

Programming, architecture, application development, message queues, middleware, databases, containerization, big data, image processing, machine learning, AI, personal growth.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.