Backend Development 10 min read

Boost Elasticsearch Prefix Search with Edge N‑Gram: A Practical Guide

This article explains how to overcome performance bottlenecks in high‑concurrency Elasticsearch prefix searches by applying an edge‑ngram tokenizer, detailing the problem, configuration steps, analysis results, and query recommendations for faster, more accurate search experiences.

Dangbei Technology Team

Jun 28, 2024

Boost Elasticsearch Prefix Search with Edge N‑Gram: A Practical Guide

Preface

Over the past decade Elasticsearch has become a popular open‑source search and analytics engine, widely used in offline data warehouses, real‑time retrieval, and enterprise search services. However, optimization material for high‑concurrency, high‑availability consumer‑facing scenarios remains scarce.

This article shares a concrete optimization case from large‑screen projection search to inspire innovative thinking and practice in Elasticsearch performance tuning.

In the video search domain, Dangbei uses Elasticsearch as its core engine and successfully handled massive traffic during past Spring Festival Gala events. As data volume grew, processing time and CPU load increased, with the main bottleneck identified in the prefix‑search stage. The solution adopted an edge n‑gram tokenization strategy, dramatically improving query efficiency.

Background

Elasticsearch is the primary search engine for the video content search business. To provide flexible and convenient search, the business logic implements a special wildcard prefix search for queries of three characters or fewer.

When users type three or fewer characters, the system automatically performs a prefix‑match query, quickly retrieving film titles that start with the entered characters, thereby enhancing user experience and search efficiency.

Pain Points of Prefix Search

Example query:

POST /xy_test_pinyin_ik/_search
{
  "query": {
    "wildcard": {
      "text": {
        "value": "杭州当贝*"
      }
    }
  }
}

This approach typically leads to several issues:

Wildcard queries, especially those starting with a wildcard, cause Elasticsearch to scan many index terms, slowing down the query.

Such queries cannot fully leverage index optimizations like inverted indexes and caching.

Longer prefixes match fewer documents and perform better; very short prefixes (e.g., a single character) match too many documents, hurting performance.

Query caches are often ineffective for wildcard queries because results change frequently with index updates.

Wildcard queries may increase memory usage due to the need to maintain additional state for complex matching.

What Is n‑gram?

n‑gram

is a language model that splits text into continuous sequences of n characters.

For the word “Elasticsearch”:

2‑gram (bigram) yields “El”, “la”, “as”, “si”, “ic”, “ch”, “he”, “es”, “se”.

3‑gram (trigram) yields “Ela”, “las”, “asi”, “sic”, “ich”, "che", "hes", "ese".

What Is edge_ngram?

edge_ngram

is a specialized n‑gram tokenizer designed for prefix matching; it generates n‑grams only from the beginning of a token. This approach improves search performance, reduces index size, and enhances user experience, especially for autocomplete and suggestion features.

Advantages of edge_ngram include:

Fast prefix matching by indexing only the beginning of each term.

Improved search speed because fewer index entries are created.

Smaller index size compared to full n‑gram tokenizers, saving disk space and memory.

Better autocomplete experience with instant feedback.

Higher relevance by returning only documents whose terms start with the query prefix.

Easy configuration via min_gram and max_gram parameters.

Language‑agnostic support for any language.

Reduced irrelevant matches, increasing precision.

Overall, edge_ngram is an ideal choice for efficient prefix search.

Applying edge_ngram to the word “Elastic” produces the following prefixes:

Length 1: E

Length 2: El

Length 3: Ela

Length 4: Elas

Length 5: Elast

Length 6: Elasti

Length 7: Elastic

Using edge_ngram for Prefix Search

1. Create an index with an edge_ngram tokenizer

PUT xy_test_pinyin_ik
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index": { "max_ngram_diff": 10 },
    "analysis": {
      "analyzer": {
        "custom_analyzer": { "tokenizer": "custom_tokenizer" }
      },
      "tokenizer": {
        "custom_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 7
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "text", "analyzer": "custom_analyzer" }
    }
  }
}

The tokenizer generates n‑grams with a minimum length of 1 and a maximum of 7.

2. Analyze tokenization results

Running the analyzer on a sample string yields the expected tokens:

POST /xy_test_pinyin_ik/_analyze
{
  "text": "杭州当贝网络科技有限公司",
  "analyzer": "custom_analyzer"
}

{
  "tokens": [
    { "token": "杭", "position": 0 },
    { "token": "杭州", "position": 1 },
    { "token": "杭州当", "position": 2 },
    { "token": "杭州当贝", "position": 3 },
    { "token": "杭州当贝网", "position": 4 },
    { "token": "杭州当贝网络", "position": 5 },
    { "token": "杭州当贝网络科", "position": 6 }
  ]
}

3. Index documents

POST /xy_test_pinyin_ik/_doc/1
{ "text": "杭州当贝网络科技有限公司" }
POST /xy_test_pinyin_ik/_doc/2
{ "text": "杭州" }

4. Query data

Using match returns broader results, while match_phrase ensures each term appears in the correct order and position, which matches our expectations.

GET /xy_test_pinyin_ik/_search
{
  "query": {
    "match_phrase": { "text": "杭州当贝" }
  }
}

The query returns the document containing "杭州当贝网络科技有限公司" as expected.

Conclusion

This article analyzed a performance problem in a search‑heavy scenario, selected the edge n‑gram tokenizer as the solution, integrated it, and verified that it completely resolves the bottleneck. The approach provides a reusable pattern for tackling similar Elasticsearch performance challenges.

References

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html

https://segmentfault.com/a/1190000022100153

https://blog.csdn.net/tiancityycf/article/details/114847911

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch edge n-gram prefix search

Written by

Dangbei Technology Team

Dangbei Technology Team public account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.