Backend Development 11 min read

Mastering Elasticsearch Analyzers: A Deep Dive into Tokenizers and Filters

This article explains how Elasticsearch uses Analyzer components—character filters, tokenizers, and token filters—to perform text analysis, reviews the built‑in analyzers such as standard, simple, stop, whitespace, keyword, pattern, language, ICU and IK, and provides practical _analyze API examples with code snippets and result screenshots.

MaGe Linux Operations

Jun 1, 2020

Mastering Elasticsearch Analyzers: A Deep Dive into Tokenizers and Filters

In Elasticsearch, text analysis is performed by an Analyzer that converts raw text into a stream of terms (tokens). The article introduces the concept of analysis, explains that an Analyzer is implemented via the _analyze API, and shows how the same analyzer must be used both at indexing time and query time.

Analysis and Analyzer Tokenizers

Analysis is the process of turning full text into a series of words (terms/tokens). It is realized through an Analyzer, which can be one of Elasticsearch's built‑in analyzers or a custom‑defined one.

Components of an Analyzer

Character Filters

: preprocess the original text (e.g., strip HTML). Tokenizer: splits the text into individual tokens according to defined rules. Token Filter: post‑processes tokens (lower‑casing, stop‑word removal, synonym expansion, etc.).

Built‑in Elasticsearch Analyzers

Standard Analyzer

: default analyzer; splits on word boundaries and lower‑cases tokens. Simple Analyzer: splits on non‑letters, removes symbols, and lower‑cases. Stop Analyzer: lower‑cases and removes common stop words (e.g., the, a, is). Whitespace Analyzer: splits on whitespace only, without lower‑casing. Keyword Analyzer: does not split; returns the entire input as a single token. Pattern Analyzer: uses a regular expression (default \W+) to split non‑character symbols. Language Analyzer: language‑specific analyzers (e.g., English) that perform stemming and stop‑word filtering. ICU Analyzer: provided via the analysis‑icu plugin, adds Unicode support and better handling of Asian languages. IK Analyzer: a Chinese analyzer plugin offering smart and max‑word tokenization modes.

_analyze API Overview

Typical usage examples:

// specify analyzer for testing
GET _analyze {
  "analyzer": "standard",
  "text": "Mastering Elasticsearch, elasticsearch in Action"
}

// analyze a specific field in an index
POST my_index/_analyze {
  "field": "title",
  "text": "Mastering Elasticsearch"
}

// custom analyzer example
POST _analyze {
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "Mastering Elasticsearch"
}

Standard Analyzer Example

GET _analyze {
  "analyzer": "standard",
  "text": "3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

The result shows each word lower‑cased and split correctly.

Simple Analyzer Example

GET _analyze {
  "analyzer": "simple",
  "text": "3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

Non‑letter characters (e.g., the leading "3" and the hyphen) are removed, and the remaining tokens are lower‑cased.

Stop Analyzer Example

GET _analyze {
  "analyzer": "stop",
  "text": "3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

The output is lower‑cased and stop words such as "in" and "the" are removed.

Whitespace Analyzer Example

GET _analyze {
  "analyzer": "whitespace",
  "text": "中华 人民 共 和国 3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

Tokens are split on spaces only; no lower‑casing occurs.

Keyword Analyzer Example

GET _analyze {
  "analyzer": "keyword",
  "text": "中华 人民 共 和国 3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

The entire input is returned as a single token, preserving original characters and case.

Pattern Analyzer Example

GET _analyze {
  "analyzer": "pattern",
  "text": "中华 人民 共 和国 3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

Splits on non‑character symbols (e.g., the hyphen) and lower‑cases the tokens.

Language Analyzer (English) Example

GET _analyze {
  "analyzer": "english",
  "text": "3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

Applies stemming (e.g., "running" → "run"), lower‑casing, and stop‑word removal.

ICU Analyzer

The ICU Analyzer is provided via the analysis‑icu plugin and adds full Unicode support, especially useful for Asian languages.

curl -XGET 'http://192.168.31.215:9201/_cat/plugins?v'

ICU Analyzer Example

GET _analyze {
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理！"
}

The result groups Chinese characters meaningfully, outperforming the standard analyzer.

IK Chinese Analyzer

IK is a third‑party Chinese analyzer that supports custom dictionaries and hot updates. Installation can be done online or offline, followed by a restart of Elasticsearch.

# Online installation
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.0/elasticsearch-analysis-ik-7.4.0.zip

# Offline installation (unzip the zip into the plugins directory)

After installation, the plugin appears in the plugin list:

elastic_node1 analysis-icu 7.4.0
elastic_node1 analysis-ik 7.4.0

From version 5.0.0 onward, the original ik analyzer is replaced by ik_smart and ik_max_word modes.

// ik_max_word
GET _analyze {
  "analyzer": "ik_max_word",
  "text": "他说的确实在理！"
}

// ik_smart
GET _analyze {
  "analyzer": "ik_smart",
  "text": "他说的确实在理！"
}

☆ END ☆

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Elasticsearch tokenizer text analysis ik-analyzer analyzer ICU Plugin

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.