Databases 8 min read

Boost Fuzzy Search in Elasticsearch: ngram vs Wildcard Field Explained

This article compares Elasticsearch's ngram analyzer and the newer wildcard field for fuzzy searching, detailing configuration steps, performance trade‑offs, storage impact, and practical test results to help engineers choose the optimal approach for their use case.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Boost Fuzzy Search in Elasticsearch: ngram vs Wildcard Field Explained

Background

In production, Elasticsearch often needs to support fuzzy queries in addition to exact matches.

Solution 1 – ngram Analyzer

The ngram tokenizer splits indexed text into fine‑grained tokens, enabling fast recall by matching on token prefixes and suffixes. It trades space for speed, requiring larger index size and a solid understanding of tokenizers.

PUT test-005
{
  "settings": {
    "index.max_ngram_diff": 10,
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 10,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {"keyword": {"type": "keyword"}}
      }
    }
  }
}

POST test-005/_bulk
{ "index": {"_id":1}}
{ "title":"英文官网承认刘强东一度被捕的原因是涉嫌性侵"}
{ "index": {"_id":2}}
{ "title":"别提了朋友哥哥刘强东窗事发了"}
{ "index": {"_id":3}}
{ "title":"刘强东施效颦,没想到竟然收获了流量"}
{ "index": {"_id":4}}
{ "title":"刘强东是谁?我不认识"}

POST test-005/_search
{
  "query": {"match_phrase": {"title": "刘强东"}}
}

Advantages: fast recall, low runtime cost.

Disadvantages: significant storage overhead, higher granularity increases space usage, and a learning curve for tokenizer configuration.

Empirical data shows the ngram‑based index can be up to ten times larger than a keyword index.

Solution 2 – Wildcard Query

The wildcard query provides SQL‑like LIKE functionality. Internally, Lucene builds a deterministic finite automaton (DFA) from the pattern, which can be costly for complex patterns.

Advantages: simple to use, no extra storage required.

Disadvantages: high runtime cost; misuse can cause production incidents.

Elasticsearch 7.9 introduced a dedicated wildcard field type to address fuzzy matching efficiently.

Wildcard Field Usage

Define a wildcard field in the mapping, index a document, and query with wildcards. The field also supports a case_insensitive option.

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_wildcard": {"type": "wildcard"}
    }
  }
}

PUT my-index-000001/_doc/1
{ "my_wildcard": "This string can be quite lengthy" }

GET my-index-000001/_search
{
  "query": {"wildcard": {"my_wildcard": "*quite*lengthy"}}
}

GET my-index-000001/_search
{
  "query": {"wildcard": {"my_wildcard": {"value": "*Quite*lengthy", "case_insensitive": true}}}
}

Wildcard Implementation Details

The new field stores two structures: an n‑gram index of all three‑character sequences and a binary doc‑value of the original field, combining fast candidate generation with high compression.

Performance Tests

Comparing a keyword index with a wildcard index on several queries shows substantial speed gains for the wildcard type, especially when the query term has low discriminative power.

Query "红豆": keyword 715 ms vs wildcard 71 ms

Query "006-612014": keyword 633 ms vs wildcard 22 ms

Query "55": keyword 584 ms vs wildcard 188 ms

Query "11": keyword 1359 ms vs wildcard 357 ms

Overall, wildcard fields can reduce query latency to roughly one‑third in low‑selectivity scenarios and to one‑fifteenth in high‑selectivity cases.

Conclusion

Wildcard fields satisfy most fuzzy‑search requirements with better performance than ngram analyzers, while consuming less storage. However, their efficiency still depends on data selectivity, and developers should benchmark both approaches for their specific workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceElasticsearchfuzzy-searchwildcardsearch optimizationNGram
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.