Backend Development 13 min read

Why Can’t Elasticsearch Find My Logs? Uncovering Full‑Text Search Pitfalls and Tokenizer Tweaks

This article explains why large‑scale Elasticsearch clusters may miss log entries during keyword searches, dives into the fundamentals of inverted indexes and tokenization, and demonstrates practical index‑time and query‑time tokenizer optimizations—including custom analyzers for English and Chinese—to dramatically improve search recall and precision.

Efficient Ops
Efficient Ops
Efficient Ops
Why Can’t Elasticsearch Find My Logs? Uncovering Full‑Text Search Pitfalls and Tokenizer Tweaks

The author works in a bank's technology department and has been using Elasticsearch (ES) since 2018, moving from ES 5.x to 7.x while handling log retrieval, data analysis, and intelligent monitoring across the organization.

The log platform stores logs from over 1,000 systems, ingesting about 160 billion entries per day with peak write rates of 4 million records per second, managing more than 15,000 indices and a total data volume of over a petabyte.

Despite successful ingestion, keyword searches often fail to return expected error logs, prompting an investigation into ES full‑text search mechanics.

1. ES Full‑Text Search Principles

1.1 Inverted Index

Full‑text search relies on two methods: sequential scanning and inverted indexing. Inverted indexes map terms to documents, reducing query complexity to O(1) and vastly improving performance.

The typical workflow for building an inverted index (illustrated below) involves tokenizing documents, creating term‑to‑document mappings, and storing the resulting index.

After tokenization, each document’s terms are extracted:

These terms are then mapped to document IDs, forming a simple inverted index:

1.2 Data Index Writing Process

Create document object: ES stores each record as a JSON document with fields representing columns.

Analyze document: Apply tokenization algorithms to extract keywords from field contents.

Index creation: Pass keywords to the index component, which builds a dictionary (the inverted index) linking terms to documents.

1.3 Data Query Process

Query statement: Users input keywords, optionally combined with AND, OR, NOT.

Execute search: Keywords are re‑tokenized, stop words removed, and a syntax tree is built to route the query to relevant shards.

Result sorting: Documents are ranked based on relevance to the query.

1.4 Tokenization

Tokenization consists of three parts: character filter, tokenizer, and token filter.

Character filter: Pre‑processes text (e.g., removes HTML tags like

<b>

).

Tokenizer: Splits the filtered text into individual terms (e.g., "QUICK brown fox!" → [quick, brown, fox]).

Token filter: Post‑processes tokens (e.g., lower‑casing, removing stop words).

2. Solving the Problem

Understanding ES full‑text search revealed that optimizing tokenization—using fine‑grained tokenizers at index time and coarse‑grained tokenizers at query time—can resolve missed log entries.

2.1 Index‑Time Tokenizer Optimization

ES’s default English analyzer treats strings like

java.outofmemory

as a single token, preventing matches on

outofmemory

. The following steps demonstrate analysis and a custom analyzer solution.

<code>POST token_test_index-2021.05.26/_doc
{
  "message":"java.outofmemory"
}</code>
<code>POST token_test_index-2021.05.26/_analyze
{
  "analyzer":"standard",
  "text": "java.outofmemory"
}</code>

The analysis returns a single token

java.outofmemory

, and a search for

outofmemory

yields no hits.

<code>{
  "took" : 3,
  "timed_out" : false,
  "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "hits" : [] }
}</code>

To split such terms, a custom analyzer is created that applies a character filter removing special characters.

<code>PUT _template/token_test_index{
  "order" : 10,
  "index_patterns" : ["token_test_index-*"],
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "my_analyzer1" : {
          "type" : "custom",
          "char_filter" : ["special_char_filter"],
          "tokenizer" : "standard"
        }
      },
      "char_filter" : {
        "special_char_filter" : {
          "type" : "mapping",
          "mappings" : ["· =>  "]
        }
      }
    }
  },
  "mappings" : {
    "properties" : {
      "message" : {
        "type" : "text",
        "analyzer" : "my_analyzer1",
        "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } }
      }
    }
  }
}</code>

After applying the custom analyzer, searching for

outofmemory

returns the expected document.

<code>{
  "took" : 2,
  "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "hits" : [ { "_source" : { "message" : "java.outofmemory" } } ] }
}</code>

The custom filter replaces eight special characters (":", ".", "'", "_", "·", ":", "‘", "’") with spaces, greatly improving keyword hit rates.

2.2 Query‑Time Tokenizer Optimization

Coarse‑grained tokenization at query time filters out noise and boosts relevance. For Chinese logs, a custom CRF‑based tokenizer was developed.

<code>POST test_index/_doc
{
  "message":"数据解析失败"
}
POST test_index/_analyze
{
  "analyzer":"crf_analyzer",
  "text":"数据解析失败"
}</code>
<code>{
  "tokens" : [
    {"token":"数据","position":0},
    {"token":"解析","position":1},
    {"token":"失败","position":2}
  ]
}</code>

Applying this analyzer to query processing reduces irrelevant results and speeds up log retrieval.

3. Summary

By deeply understanding ES search principles and applying targeted tokenizer optimizations—fine‑grained at indexing and coarse‑grained at querying, along with custom Chinese tokenizers—the log search recall and precision were significantly improved, offering a practical solution for large‑scale log analysis.

ElasticsearchInverted IndexLog AnalysisFull-Text SearchTokenizersearch optimization
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.