Mastering Elasticsearch Analyzers: A Deep Dive into Tokenizers and Filters
This article explains how Elasticsearch uses Analyzer components—character filters, tokenizers, and token filters—to perform text analysis, reviews the built‑in analyzers such as standard, simple, stop, whitespace, keyword, pattern, language, ICU and IK, and provides practical _analyze API examples with code snippets and result screenshots.
In Elasticsearch, text analysis is performed by an Analyzer that converts raw text into a stream of terms (tokens). The article introduces the concept of analysis, explains that an Analyzer is implemented via the _analyze API, and shows how the same analyzer must be used both at indexing time and query time.
Analysis and Analyzer Tokenizers
Analysis is the process of turning full text into a series of words (terms/tokens). It is realized through an Analyzer, which can be one of Elasticsearch's built‑in analyzers or a custom‑defined one.
Components of an Analyzer
Character Filters: preprocess the original text (e.g., strip HTML). Tokenizer: splits the text into individual tokens according to defined rules. Token Filter: post‑processes tokens (lower‑casing, stop‑word removal, synonym expansion, etc.).
Built‑in Elasticsearch Analyzers
Standard Analyzer: default analyzer; splits on word boundaries and lower‑cases tokens. Simple Analyzer: splits on non‑letters, removes symbols, and lower‑cases. Stop Analyzer: lower‑cases and removes common stop words (e.g., the, a, is). Whitespace Analyzer: splits on whitespace only, without lower‑casing. Keyword Analyzer: does not split; returns the entire input as a single token. Pattern Analyzer: uses a regular expression (default \W+) to split non‑character symbols. Language Analyzer: language‑specific analyzers (e.g., English) that perform stemming and stop‑word filtering. ICU Analyzer: provided via the analysis‑icu plugin, adds Unicode support and better handling of Asian languages. IK Analyzer: a Chinese analyzer plugin offering smart and max‑word tokenization modes.
_analyze API Overview
Typical usage examples:
// specify analyzer for testing
GET _analyze {
"analyzer": "standard",
"text": "Mastering Elasticsearch, elasticsearch in Action"
}
// analyze a specific field in an index
POST my_index/_analyze {
"field": "title",
"text": "Mastering Elasticsearch"
}
// custom analyzer example
POST _analyze {
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "Mastering Elasticsearch"
}Standard Analyzer Example
GET _analyze {
"analyzer": "standard",
"text": "3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}The result shows each word lower‑cased and split correctly.
Simple Analyzer Example
GET _analyze {
"analyzer": "simple",
"text": "3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}Non‑letter characters (e.g., the leading "3" and the hyphen) are removed, and the remaining tokens are lower‑cased.
Stop Analyzer Example
GET _analyze {
"analyzer": "stop",
"text": "3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}The output is lower‑cased and stop words such as "in" and "the" are removed.
Whitespace Analyzer Example
GET _analyze {
"analyzer": "whitespace",
"text": "中华 人民 共 和国 3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}Tokens are split on spaces only; no lower‑casing occurs.
Keyword Analyzer Example
GET _analyze {
"analyzer": "keyword",
"text": "中华 人民 共 和国 3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}The entire input is returned as a single token, preserving original characters and case.
Pattern Analyzer Example
GET _analyze {
"analyzer": "pattern",
"text": "中华 人民 共 和国 3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}Splits on non‑character symbols (e.g., the hyphen) and lower‑cases the tokens.
Language Analyzer (English) Example
GET _analyze {
"analyzer": "english",
"text": "3 running Quick brown-foxes leap over lazy dogs in the summer evening."
}Applies stemming (e.g., "running" → "run"), lower‑casing, and stop‑word removal.
ICU Analyzer
The ICU Analyzer is provided via the analysis‑icu plugin and adds full Unicode support, especially useful for Asian languages.
curl -XGET 'http://192.168.31.215:9201/_cat/plugins?v'ICU Analyzer Example
GET _analyze {
"analyzer": "icu_analyzer",
"text": "他说的确实在理!"
}The result groups Chinese characters meaningfully, outperforming the standard analyzer.
IK Chinese Analyzer
IK is a third‑party Chinese analyzer that supports custom dictionaries and hot updates. Installation can be done online or offline, followed by a restart of Elasticsearch.
# Online installation
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.0/elasticsearch-analysis-ik-7.4.0.zip
# Offline installation (unzip the zip into the plugins directory)After installation, the plugin appears in the plugin list:
elastic_node1 analysis-icu 7.4.0
elastic_node1 analysis-ik 7.4.0From version 5.0.0 onward, the original ik analyzer is replaced by ik_smart and ik_max_word modes.
// ik_max_word
GET _analyze {
"analyzer": "ik_max_word",
"text": "他说的确实在理!"
}
// ik_smart
GET _analyze {
"analyzer": "ik_smart",
"text": "他说的确实在理!"
}☆ END ☆
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
