Boost Elasticsearch Prefix Search with Edge N‑Gram: A Practical Guide
This article explains how to overcome performance bottlenecks in high‑concurrency Elasticsearch prefix searches by applying an edge‑ngram tokenizer, detailing the problem, configuration steps, analysis results, and query recommendations for faster, more accurate search experiences.
Preface
Over the past decade Elasticsearch has become a popular open‑source search and analytics engine, widely used in offline data warehouses, real‑time retrieval, and enterprise search services. However, optimization material for high‑concurrency, high‑availability consumer‑facing scenarios remains scarce.
This article shares a concrete optimization case from large‑screen projection search to inspire innovative thinking and practice in Elasticsearch performance tuning.
In the video search domain, Dangbei uses Elasticsearch as its core engine and successfully handled massive traffic during past Spring Festival Gala events. As data volume grew, processing time and CPU load increased, with the main bottleneck identified in the prefix‑search stage. The solution adopted an edge n‑gram tokenization strategy, dramatically improving query efficiency.
Background
Elasticsearch is the primary search engine for the video content search business. To provide flexible and convenient search, the business logic implements a special wildcard prefix search for queries of three characters or fewer.
When users type three or fewer characters, the system automatically performs a prefix‑match query, quickly retrieving film titles that start with the entered characters, thereby enhancing user experience and search efficiency.
Pain Points of Prefix Search
Example query:
POST /xy_test_pinyin_ik/_search
{
"query": {
"wildcard": {
"text": {
"value": "杭州当贝*"
}
}
}
}This approach typically leads to several issues:
Wildcard queries, especially those starting with a wildcard, cause Elasticsearch to scan many index terms, slowing down the query.
Such queries cannot fully leverage index optimizations like inverted indexes and caching.
Longer prefixes match fewer documents and perform better; very short prefixes (e.g., a single character) match too many documents, hurting performance.
Query caches are often ineffective for wildcard queries because results change frequently with index updates.
Wildcard queries may increase memory usage due to the need to maintain additional state for complex matching.
What Is n‑gram?
n‑gramis a language model that splits text into continuous sequences of n characters.
For the word “Elasticsearch”:
2‑gram (bigram) yields “El”, “la”, “as”, “si”, “ic”, “ch”, “he”, “es”, “se”.
3‑gram (trigram) yields “Ela”, “las”, “asi”, “sic”, “ich”, "che", "hes", "ese".
What Is edge_ngram?
edge_ngramis a specialized n‑gram tokenizer designed for prefix matching; it generates n‑grams only from the beginning of a token. This approach improves search performance, reduces index size, and enhances user experience, especially for autocomplete and suggestion features.
Advantages of edge_ngram include:
Fast prefix matching by indexing only the beginning of each term.
Improved search speed because fewer index entries are created.
Smaller index size compared to full n‑gram tokenizers, saving disk space and memory.
Better autocomplete experience with instant feedback.
Higher relevance by returning only documents whose terms start with the query prefix.
Easy configuration via min_gram and max_gram parameters.
Language‑agnostic support for any language.
Reduced irrelevant matches, increasing precision.
Overall, edge_ngram is an ideal choice for efficient prefix search.
Applying edge_ngram to the word “Elastic” produces the following prefixes:
Length 1: E
Length 2: El
Length 3: Ela
Length 4: Elas
Length 5: Elast
Length 6: Elasti
Length 7: Elastic
Using edge_ngram for Prefix Search
1. Create an index with an edge_ngram tokenizer
PUT xy_test_pinyin_ik
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"index": { "max_ngram_diff": 10 },
"analysis": {
"analyzer": {
"custom_analyzer": { "tokenizer": "custom_tokenizer" }
},
"tokenizer": {
"custom_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 7
}
}
}
},
"mappings": {
"properties": {
"text": { "type": "text", "analyzer": "custom_analyzer" }
}
}
}The tokenizer generates n‑grams with a minimum length of 1 and a maximum of 7.
2. Analyze tokenization results
Running the analyzer on a sample string yields the expected tokens:
POST /xy_test_pinyin_ik/_analyze
{
"text": "杭州当贝网络科技有限公司",
"analyzer": "custom_analyzer"
} {
"tokens": [
{ "token": "杭", "position": 0 },
{ "token": "杭州", "position": 1 },
{ "token": "杭州当", "position": 2 },
{ "token": "杭州当贝", "position": 3 },
{ "token": "杭州当贝网", "position": 4 },
{ "token": "杭州当贝网络", "position": 5 },
{ "token": "杭州当贝网络科", "position": 6 }
]
}3. Index documents
POST /xy_test_pinyin_ik/_doc/1
{ "text": "杭州当贝网络科技有限公司" }
POST /xy_test_pinyin_ik/_doc/2
{ "text": "杭州" }4. Query data
Using match returns broader results, while match_phrase ensures each term appears in the correct order and position, which matches our expectations.
GET /xy_test_pinyin_ik/_search
{
"query": {
"match_phrase": { "text": "杭州当贝" }
}
}The query returns the document containing "杭州当贝网络科技有限公司" as expected.
Conclusion
This article analyzed a performance problem in a search‑heavy scenario, selected the edge n‑gram tokenizer as the solution, integrated it, and verified that it completely resolves the bottleneck. The approach provides a reusable pattern for tackling similar Elasticsearch performance challenges.
References
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
https://segmentfault.com/a/1190000022100153
https://blog.csdn.net/tiancityycf/article/details/114847911
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
