Fundamentals 5 min read

Understanding Elasticsearch Analyzer, Tokenizer, and Token Filters

This article explains the core components of Elasticsearch's full‑text search analysis—Analyzers, Tokenizers, and Token Filters—detailing their roles, building blocks, built‑in types, and how they combine to customize text processing for effective indexing and querying.

System Architect Go

Sep 3, 2018

Understanding Elasticsearch Analyzer, Tokenizer, and Token Filters

In Elasticsearch (ES), full‑text search relies on analysis; an Analyzer splits text into tokens, enabling the creation of an inverted index.

Analyzer consists of three building blocks:

Character filters – process the raw character stream (e.g., remove HTML tags, replace characters, apply regex).

Tokenizer – divides the character stream into individual tokens (words or terms).

Token filters – further modify the token stream (add, delete, or transform tokens).

Not every Analyzer must include all three components; a typical Analyzer may have zero or more character filters, exactly one tokenizer, and zero or more token filters.

Character Filters can remove HTML elements, replace specific characters (e.g., abc => 123), or apply regular‑expression replacements. ES provides three built‑in character filter types (illustrated in the following image).

Tokenizer splits the character stream into tokens according to defined rules. The article clarifies the difference between words, letters, and characters, and between tokens and terms (as defined in Lucene). ES includes fifteen built‑in tokenizers, grouped into three categories:

Word‑oriented tokenizers

Tokenizers that operate on parts of a token

Structured‑text tokenizers

Token Filters take the token stream produced by the tokenizer and apply further transformations (add, delete, or modify tokens). ES ships with dozens of built‑in token filters (illustrated below).

ES also provides a collection of built‑in analyzers that combine character filters, a tokenizer, and token filters. By selecting and combining the desired components, users can easily create custom analyzers suited to their specific text‑processing needs.

Beyond the built‑in options, many open‑source analyzers (e.g., the IK analyzer for Chinese) are available for specialized use cases.

This article offers a concise overview of the analysis stage in Elasticsearch’s full‑text search pipeline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Full-Text Search tokenizer text analysis analyzer Token Filter

Written by

System Architect Go

Programming, architecture, application development, message queues, middleware, databases, containerization, big data, image processing, machine learning, AI, personal growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.