Elasticsearch Analysis: Analyzers, Tokenizers, Filters, and API Usage
This article explains how Elasticsearch processes text before indexing by describing the analysis pipeline, built‑in and custom analyzers, tokenizers, token filters, n‑gram techniques, the analysis API, and the IK Chinese tokenizer plugin, providing practical curl examples throughout.
What is Analysis
Analysis in Elasticsearch occurs before a document is indexed; each field passes through character filters, tokenization, token filters, and finally token indexing, producing the inverted index used for search.
Analysis of Documents
Analyzers can be set per index at creation time or globally in the elasticsearch.yml configuration file, allowing custom pipelines for specific fields.
Custom Analyzer Example
curl -XPUT '172.16.1.127:9200/myindex?pretty' -H 'Content-Type: application/json' -d '{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"index": {
"analysis": {
"analyzer": {
"myCustomAnalyzer": {
"type": "custom",
"tokenizer": "myCustomTokenizer",
"filter": ["myCustomFilter1", "myCustomFilter2"],
"char_filter": ["myCustomCharFilter"]
}
},
"tokenizer": {
"myCustomTokenizer": { "type": "letter" }
},
"filter": {
"myCustomFilter1": { "type": "lowercase" },
"myCustomFilter2": { "type": "kstem" }
},
"char_filter": {
"myCustomCharFilter": {
"type": "mapping",
"mappings": ["ph=>f", "u=>you"]
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"description": { "type": "text", "analyzer": "myCustomAnalyzer" },
"name": { "type": "text", "analyzer": "standard", "fields": { "raw": { "index": false, "type": "text" } } }
}
}
}
}'Analyzer Configuration in elasticsearch.yml
index:
analysis:
analyzer:
myCustomAnalyzer:
type: custom
tokenizer: myCustomTokenizer
filter: [myCustomFilter1, myCustomFilter2]
char_filter: myCustomCharFilter
tokenizer:
myCustomTokenizer:
type: letter
filter:
myCustomFilter1:
type: lowercase
myCustomFilter2:
type: kstem
char_filter:
myCustomCharFilter:
type: mapping
mappings: ["ph=>f", "u=>you"]Analysis API
The _analyze endpoint lets you test any analyzer, tokenizer, or token filter on arbitrary text and returns the resulting tokens.
curl -X GET "172.16.1.127:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'{
"analyzer": "standard",
"text": "share your experience with NoSql & big data technologies"
}'The response includes token text, offsets, type, and position, showing how the standard analyzer lower‑cases and splits the input.
Built‑in Analyzers
Standard analyzer – uses standard tokenizer, lowercase filter, and stop‑word filter.
Simple analyzer – only lowercase filter.
Whitespace analyzer – splits on whitespace only.
Stop analyzer – like simple but removes stop words.
Keyword analyzer – treats the whole field as a single token.
Pattern analyzer – uses a regex pattern for tokenization.
Snowball analyzer – adds stemming.
Tokenizers
Standard tokenizer – Unicode‑aware, removes punctuation.
Keyword tokenizer – single token.
Letter tokenizer – splits on non‑letters.
Lowercase tokenizer – combines letter tokenization with lower‑casing.
Whitespace tokenizer – splits on whitespace, keeps punctuation.
Pattern tokenizer – custom regex pattern.
UAX URL email tokenizer – extracts URLs and emails as tokens.
Path hierarchy tokenizer – tokenizes file‑system paths.
Token Filters
Standard (no‑op), Lowercase, Length, Stop, Truncate, Trim, Limit token count, Reverse, Unique, ASCII folding, Synonym, etc.
N‑gram Filters
Edge‑ngram and n‑gram filters generate character‑level n‑grams; side‑edge n‑grams generate prefixes. Example configuration creates an analyzer that reverses text, applies an edge‑ngram filter, then reverses again.
curl -XPUT '172.16.1.127:9200/ng?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
"analysis": {
"analyzer": {
"ng1": {
"type": "custom",
"tokenizer": "standard",
"filter": ["reverse", "ngf1", "reverse"]
}
},
"filter": {
"ngf1": { "type": "edge_ngram", "min_gram": 2, "max_gram": 6 }
}
}
}
}'
curl -X GET "172.16.1.127:9200/ng/_analyze?pretty" -H 'Content-Type: application/json' -d'{ "analyzer": "ng1", "text": "spaghetti" }'Shingle (Sliding Window) Filter
Creates token‑level n‑grams (shingles) for phrase matching.
curl -XPUT '172.16.1.127:9200/shingle?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
"analysis": {
"analyzer": { "shingle1": { "type": "custom", "tokenizer": "standard", "filter": ["shingle-filter"] } },
"filter": { "shingle-filter": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 3, "output_unigrams": false } }
}
}
}'
curl -X GET "172.16.1.127:9200/shingle/_analyze?pretty" -H 'Content-Type: application/json' -d'{ "analyzer": "shingle1", "text": "foo bar baz" }'IK Chinese Analyzer Plugin
Installation and usage examples for the IK plugin, which provides Chinese word segmentation for Elasticsearch.
/home/elasticsearch/elasticsearch-6.4.3/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.3/elasticsearch-analysis-ik-6.4.3.zip
/home/elasticsearch/elasticsearch-6.4.3/bin/elasticsearch -d
curl -XPOST http://172.16.1.127:9200/index/fulltext/1?pretty -H 'Content-Type:application/json' -d'{"content":"美国留给伊拉克的是个烂摊子吗"}'
# ... additional indexing and search examples ...Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
