Big Data 22 min read

Elasticsearch Analysis: Analyzers, Tokenizers, Filters, and API Usage

This article explains how Elasticsearch processes text before indexing by describing the analysis pipeline, built‑in and custom analyzers, tokenizers, token filters, n‑gram techniques, the analysis API, and the IK Chinese tokenizer plugin, providing practical curl examples throughout.

Big Data Technology & Architecture

Dec 5, 2020

Elasticsearch Analysis: Analyzers, Tokenizers, Filters, and API Usage

What is Analysis

Analysis in Elasticsearch occurs before a document is indexed; each field passes through character filters, tokenization, token filters, and finally token indexing, producing the inverted index used for search.

Analysis of Documents

Analyzers can be set per index at creation time or globally in the elasticsearch.yml configuration file, allowing custom pipelines for specific fields.

Custom Analyzer Example

curl -XPUT '172.16.1.127:9200/myindex?pretty' -H 'Content-Type: application/json' -d '{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1,
    "index": {
      "analysis": {
        "analyzer": {
          "myCustomAnalyzer": {
            "type": "custom",
            "tokenizer": "myCustomTokenizer",
            "filter": ["myCustomFilter1", "myCustomFilter2"],
            "char_filter": ["myCustomCharFilter"]
          }
        },
        "tokenizer": {
          "myCustomTokenizer": { "type": "letter" }
        },
        "filter": {
          "myCustomFilter1": { "type": "lowercase" },
          "myCustomFilter2": { "type": "kstem" }
        },
        "char_filter": {
          "myCustomCharFilter": {
            "type": "mapping",
            "mappings": ["ph=>f", "u=>you"]
          }
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "description": { "type": "text", "analyzer": "myCustomAnalyzer" },
        "name": { "type": "text", "analyzer": "standard", "fields": { "raw": { "index": false, "type": "text" } } }
      }
    }
  }
}'

Analyzer Configuration in elasticsearch.yml

index:
  analysis:
    analyzer:
      myCustomAnalyzer:
        type: custom
        tokenizer: myCustomTokenizer
        filter: [myCustomFilter1, myCustomFilter2]
        char_filter: myCustomCharFilter
    tokenizer:
      myCustomTokenizer:
        type: letter
    filter:
      myCustomFilter1:
        type: lowercase
      myCustomFilter2:
        type: kstem
    char_filter:
      myCustomCharFilter:
        type: mapping
        mappings: ["ph=>f", "u=>you"]

Analysis API

The _analyze endpoint lets you test any analyzer, tokenizer, or token filter on arbitrary text and returns the resulting tokens.

curl -X GET "172.16.1.127:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'{
  "analyzer": "standard",
  "text": "share your experience with NoSql & big data technologies"
}'

The response includes token text, offsets, type, and position, showing how the standard analyzer lower‑cases and splits the input.

Built‑in Analyzers

Standard analyzer – uses standard tokenizer, lowercase filter, and stop‑word filter.

Simple analyzer – only lowercase filter.

Whitespace analyzer – splits on whitespace only.

Stop analyzer – like simple but removes stop words.

Keyword analyzer – treats the whole field as a single token.

Pattern analyzer – uses a regex pattern for tokenization.

Snowball analyzer – adds stemming.

Tokenizers

Standard tokenizer – Unicode‑aware, removes punctuation.

Keyword tokenizer – single token.

Letter tokenizer – splits on non‑letters.

Lowercase tokenizer – combines letter tokenization with lower‑casing.

Whitespace tokenizer – splits on whitespace, keeps punctuation.

Pattern tokenizer – custom regex pattern.

UAX URL email tokenizer – extracts URLs and emails as tokens.

Path hierarchy tokenizer – tokenizes file‑system paths.

Token Filters

Standard (no‑op), Lowercase, Length, Stop, Truncate, Trim, Limit token count, Reverse, Unique, ASCII folding, Synonym, etc.

N‑gram Filters

Edge‑ngram and n‑gram filters generate character‑level n‑grams; side‑edge n‑grams generate prefixes. Example configuration creates an analyzer that reverses text, applies an edge‑ngram filter, then reverses again.

curl -XPUT '172.16.1.127:9200/ng?pretty' -H 'Content-Type: application/json' -d'{
  "settings": {
    "analysis": {
      "analyzer": {
        "ng1": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["reverse", "ngf1", "reverse"]
        }
      },
      "filter": {
        "ngf1": { "type": "edge_ngram", "min_gram": 2, "max_gram": 6 }
      }
    }
  }
}'
curl -X GET "172.16.1.127:9200/ng/_analyze?pretty" -H 'Content-Type: application/json' -d'{ "analyzer": "ng1", "text": "spaghetti" }'

Shingle (Sliding Window) Filter

Creates token‑level n‑grams (shingles) for phrase matching.

curl -XPUT '172.16.1.127:9200/shingle?pretty' -H 'Content-Type: application/json' -d'{
  "settings": {
    "analysis": {
      "analyzer": { "shingle1": { "type": "custom", "tokenizer": "standard", "filter": ["shingle-filter"] } },
      "filter": { "shingle-filter": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 3, "output_unigrams": false } }
    }
  }
}'
curl -X GET "172.16.1.127:9200/shingle/_analyze?pretty" -H 'Content-Type: application/json' -d'{ "analyzer": "shingle1", "text": "foo bar baz" }'

IK Chinese Analyzer Plugin

Installation and usage examples for the IK plugin, which provides Chinese word segmentation for Elasticsearch.

/home/elasticsearch/elasticsearch-6.4.3/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.4.3/elasticsearch-analysis-ik-6.4.3.zip
/home/elasticsearch/elasticsearch-6.4.3/bin/elasticsearch -d
curl -XPOST http://172.16.1.127:9200/index/fulltext/1?pretty -H 'Content-Type:application/json' -d'{"content":"美国留给伊拉克的是个烂摊子吗"}'
# ... additional indexing and search examples ...

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Elasticsearch Analyzers Analysis Token Filters Tokenizers

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.