Databases 13 min read

How to Enable Accurate Code Search in Elasticsearch with an NGram Analyzer

This article analyzes the shortcomings of standard Elasticsearch analyzers for code search, presents a custom NGram analyzer combined with match_phrase queries, shows configuration and query examples, compares performance of different query types, and offers best‑practice guidelines and pitfalls to avoid when building a reliable code‑search system.

Mingyi World Elasticsearch
Mingyi World Elasticsearch
Mingyi World Elasticsearch
How to Enable Accurate Code Search in Elasticsearch with an NGram Analyzer

Problem: code‑search limitations

Importing source files into a standard Elasticsearch index yields poor results because the default analyzer splits identifiers (e.g., migrate_data["migrate","data"]), breaking exact matches. Wildcard queries provide partial matching but require scanning many documents, cause high cluster load and risk downtime. Prefix queries only match the beginning of a term and cannot find middle substrings such as data in migrate_data.

Solution: NGram analyzer + match_phrase

Custom NGram analyzer

{
  "settings": {
    "analysis": {
      "filter": {
        "code_ngram_filter": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 10
        }
      },
      "analyzer": {
        "code_ngram_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase", "code_ngram_filter"]
        }
      }
    }
  }
}

Field mapping design

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "code_analyzer",
        "fields": {
          "keyword": {"type": "keyword"},
          "ngram": {
            "type": "text",
            "analyzer": "code_ngram_analyzer",
            "search_analyzer": "code_ngram_analyzer"
          }
        }
      }
    }
  }
}

Design principle : the main name field uses a standard analyzer for full‑text matching, name.keyword provides exact matches with optimal speed, and name.ngram enables partial matching while keeping acceptable performance.

NGram token generation example

Indexing migrate_data with code_ngram_analyzer (min=2, max=10) produces tokens such as:

"mi", "ig", "gr", "ra", "at", "te", "e_", "_d", "da", "at", "ta", "mig", "igr", "gra", "rat", "ate", "te_", "e_d", "_da", "dat", "ata", "migr", "igr", "gra", "rat", "ate", "data", ...

Core logic: multi‑level search strategy

Query construction

def _build_keyword_query(self, query: str, language: Optional[str], project_id: Optional[str], filters: Dict[str, Any]) -> Dict[str, Any]:
    should_clauses = [
        # 1. Exact match (highest priority, best performance)
        {"term": {"name.keyword": {"value": query, "boost": 10.0}}},
        # 2. NGram phrase match (preserve order, partial match)
        {"match_phrase": {"name.ngram": {"query": query, "boost": 9.0, "slop": 0}}},
        # 3. NGram partial match (allow any token)
        {"match": {"name.ngram": {"query": query, "boost": 8.0, "operator": "or"}}},
        # 4. Standard phrase match (preserve order)
        {"match_phrase": {"name": {"query": query, "boost": 7.0}}},
        # 5. Standard match (all terms must match)
        {"match": {"name": {"query": query, "boost": 5.0, "operator": "and"}}},
        # 6. Multi‑field fallback
        {"multi_match": {
            "query": query,
            "fields": ["name^3", "name.ngram^2", "content", "signature^2"],
            "type": "best_fields",
            "fuzziness": "AUTO",
            "boost": 1.0
        }}
    ]
    return {
        "query": {
            "bool": {
                "must": [{"bool": {"should": should_clauses, "minimum_should_match": 1}}]
            }
        }
    }
}

Search effect verification

Scenario 1 – query migrate_data : term (score 10), NGram phrase (score 9), NGram partial (score 8) – all found.

Scenario 2 – query migrate : NGram phrase and partial both find migrate_data; multi‑match finds async_migrate_task.

Scenario 3 – query data : NGram partial finds migrate_data and get_user_data with high scores.

Performance comparison

term

(exact) – ★★★★★ – full match – no risk. match_phrase (ngram) – ★★★★ – partial match with order – no risk. match (ngram) – ★★★★ – partial match – no risk. match_phrase (standard) – ★★★★ – exact match with order – no risk. wildcard – ★ – any pattern – high risk (removed).

Best practices

Set min_gram = 2 and max_gram = 10 to balance index size and matching ability.

Configure max_ngram_diffmax_gram - min_gram to avoid index‑creation errors.

Use a three‑field strategy: name (standard), name.keyword (exact), name.ngram (partial).

Remove wildcard queries to eliminate performance risks.

Assign higher boost values to exact matches and lower boosts to broader matches.

Common pitfalls

Index size growth : NGram generates many tokens, increasing index size by 2‑3×. Mitigate by tuning min_gram / max_gram and applying NGram only to necessary fields.

Search latency : Queries on NGram fields can be slower. Prefer match_phrase over match for NGram fields to limit token expansion.

Reindex cost : Changing an analyzer requires reindexing because the analyzer is an index‑level setting. Use index aliases for zero‑downtime swaps and plan analyzer settings ahead of time.

Performance optimisation suggestions

Use match_phrase for NGram fields (better performance than match).

Adjust index settings: increase number_of_shards based on data volume, set appropriate number_of_replicas, and increase refresh_interval (e.g., "30s") to improve indexing throughput.

Full Python example

from elasticsearch import Elasticsearch

def create_code_index(es_client):
    mapping = {
        "settings": {
            "number_of_shards": 3,
            "number_of_replicas": 1,
            "index": {"max_ngram_diff": 10},
            "analysis": {
                "analyzer": {
                    "code_analyzer": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "filter": ["lowercase"]
                    },
                    "code_ngram_analyzer": {
                        "type": "custom",
                        "tokenizer": "keyword",
                        "filter": ["lowercase", "code_ngram_filter"]
                    }
                },
                "filter": {
                    "code_ngram_filter": {
                        "type": "ngram",
                        "min_gram": 2,
                        "max_gram": 10
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "name": {
                    "type": "text",
                    "analyzer": "code_analyzer",
                    "fields": {
                        "keyword": {"type": "keyword"},
                        "ngram": {
                            "type": "text",
                            "analyzer": "code_ngram_analyzer",
                            "search_analyzer": "code_ngram_analyzer"
                        }
                    }
                }
            }
        }
    }
    es_client.indices.create(index="code_index", body=mapping)

def search_code(es_client, query):
    search_body = {
        "query": {
            "bool": {
                "should": [
                    {"term": {"name.keyword": {"value": query, "boost": 10.0}}},
                    {"match_phrase": {"name.ngram": {"query": query, "boost": 9.0, "slop": 0}}},
                    {"match": {"name.ngram": {"query": query, "boost": 8.0, "operator": "or"}}},
                    {"match_phrase": {"name": {"query": query, "boost": 7.0}}},
                    {"match": {"name": {"query": query, "boost": 5.0, "operator": "and"}}},
                    {"multi_match": {
                        "query": query,
                        "fields": ["name^3", "name.ngram^2", "content", "signature^2"],
                        "type": "best_fields",
                        "fuzziness": "AUTO",
                        "boost": 1.0
                    }}
                ],
                "minimum_should_match": 1
            }
        }
    }
    return es_client.search(index="code_index", body=search_body)

Conclusion

Elasticsearch (or its Chinese‑made counterpart Easysearch) can fully support code search when a dedicated NGram analyzer and a multi‑level query strategy are applied. Proper parameter tuning and the removal of wildcard queries ensure good performance and system stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonElasticsearchsearch optimizationcode searchNGramanalyzer
Mingyi World Elasticsearch
Written by

Mingyi World Elasticsearch

The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.