How to Enable Accurate Code Search in Elasticsearch with an NGram Analyzer
This article analyzes the shortcomings of standard Elasticsearch analyzers for code search, presents a custom NGram analyzer combined with match_phrase queries, shows configuration and query examples, compares performance of different query types, and offers best‑practice guidelines and pitfalls to avoid when building a reliable code‑search system.
Problem: code‑search limitations
Importing source files into a standard Elasticsearch index yields poor results because the default analyzer splits identifiers (e.g., migrate_data → ["migrate","data"]), breaking exact matches. Wildcard queries provide partial matching but require scanning many documents, cause high cluster load and risk downtime. Prefix queries only match the beginning of a term and cannot find middle substrings such as data in migrate_data.
Solution: NGram analyzer + match_phrase
Custom NGram analyzer
{
"settings": {
"analysis": {
"filter": {
"code_ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10
}
},
"analyzer": {
"code_ngram_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase", "code_ngram_filter"]
}
}
}
}
}Field mapping design
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "code_analyzer",
"fields": {
"keyword": {"type": "keyword"},
"ngram": {
"type": "text",
"analyzer": "code_ngram_analyzer",
"search_analyzer": "code_ngram_analyzer"
}
}
}
}
}
}Design principle : the main name field uses a standard analyzer for full‑text matching, name.keyword provides exact matches with optimal speed, and name.ngram enables partial matching while keeping acceptable performance.
NGram token generation example
Indexing migrate_data with code_ngram_analyzer (min=2, max=10) produces tokens such as:
"mi", "ig", "gr", "ra", "at", "te", "e_", "_d", "da", "at", "ta", "mig", "igr", "gra", "rat", "ate", "te_", "e_d", "_da", "dat", "ata", "migr", "igr", "gra", "rat", "ate", "data", ...Core logic: multi‑level search strategy
Query construction
def _build_keyword_query(self, query: str, language: Optional[str], project_id: Optional[str], filters: Dict[str, Any]) -> Dict[str, Any]:
should_clauses = [
# 1. Exact match (highest priority, best performance)
{"term": {"name.keyword": {"value": query, "boost": 10.0}}},
# 2. NGram phrase match (preserve order, partial match)
{"match_phrase": {"name.ngram": {"query": query, "boost": 9.0, "slop": 0}}},
# 3. NGram partial match (allow any token)
{"match": {"name.ngram": {"query": query, "boost": 8.0, "operator": "or"}}},
# 4. Standard phrase match (preserve order)
{"match_phrase": {"name": {"query": query, "boost": 7.0}}},
# 5. Standard match (all terms must match)
{"match": {"name": {"query": query, "boost": 5.0, "operator": "and"}}},
# 6. Multi‑field fallback
{"multi_match": {
"query": query,
"fields": ["name^3", "name.ngram^2", "content", "signature^2"],
"type": "best_fields",
"fuzziness": "AUTO",
"boost": 1.0
}}
]
return {
"query": {
"bool": {
"must": [{"bool": {"should": should_clauses, "minimum_should_match": 1}}]
}
}
}
}Search effect verification
Scenario 1 – query migrate_data : term (score 10), NGram phrase (score 9), NGram partial (score 8) – all found.
Scenario 2 – query migrate : NGram phrase and partial both find migrate_data; multi‑match finds async_migrate_task.
Scenario 3 – query data : NGram partial finds migrate_data and get_user_data with high scores.
Performance comparison
term(exact) – ★★★★★ – full match – no risk. match_phrase (ngram) – ★★★★ – partial match with order – no risk. match (ngram) – ★★★★ – partial match – no risk. match_phrase (standard) – ★★★★ – exact match with order – no risk. wildcard – ★ – any pattern – high risk (removed).
Best practices
Set min_gram = 2 and max_gram = 10 to balance index size and matching ability.
Configure max_ngram_diff ≥ max_gram - min_gram to avoid index‑creation errors.
Use a three‑field strategy: name (standard), name.keyword (exact), name.ngram (partial).
Remove wildcard queries to eliminate performance risks.
Assign higher boost values to exact matches and lower boosts to broader matches.
Common pitfalls
Index size growth : NGram generates many tokens, increasing index size by 2‑3×. Mitigate by tuning min_gram / max_gram and applying NGram only to necessary fields.
Search latency : Queries on NGram fields can be slower. Prefer match_phrase over match for NGram fields to limit token expansion.
Reindex cost : Changing an analyzer requires reindexing because the analyzer is an index‑level setting. Use index aliases for zero‑downtime swaps and plan analyzer settings ahead of time.
Performance optimisation suggestions
Use match_phrase for NGram fields (better performance than match).
Adjust index settings: increase number_of_shards based on data volume, set appropriate number_of_replicas, and increase refresh_interval (e.g., "30s") to improve indexing throughput.
Full Python example
from elasticsearch import Elasticsearch
def create_code_index(es_client):
mapping = {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index": {"max_ngram_diff": 10},
"analysis": {
"analyzer": {
"code_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase"]
},
"code_ngram_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase", "code_ngram_filter"]
}
},
"filter": {
"code_ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "code_analyzer",
"fields": {
"keyword": {"type": "keyword"},
"ngram": {
"type": "text",
"analyzer": "code_ngram_analyzer",
"search_analyzer": "code_ngram_analyzer"
}
}
}
}
}
}
es_client.indices.create(index="code_index", body=mapping)
def search_code(es_client, query):
search_body = {
"query": {
"bool": {
"should": [
{"term": {"name.keyword": {"value": query, "boost": 10.0}}},
{"match_phrase": {"name.ngram": {"query": query, "boost": 9.0, "slop": 0}}},
{"match": {"name.ngram": {"query": query, "boost": 8.0, "operator": "or"}}},
{"match_phrase": {"name": {"query": query, "boost": 7.0}}},
{"match": {"name": {"query": query, "boost": 5.0, "operator": "and"}}},
{"multi_match": {
"query": query,
"fields": ["name^3", "name.ngram^2", "content", "signature^2"],
"type": "best_fields",
"fuzziness": "AUTO",
"boost": 1.0
}}
],
"minimum_should_match": 1
}
}
}
return es_client.search(index="code_index", body=search_body)Conclusion
Elasticsearch (or its Chinese‑made counterpart Easysearch) can fully support code search when a dedicated NGram analyzer and a multi‑level query strategy are applied. Proper parameter tuning and the removal of wildcard queries ensure good performance and system stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mingyi World Elasticsearch
The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
