Backend Development 27 min read

Optimizing Elasticsearch for High‑Concurrency LBS Search with an RLE‑Based Inverted Index

This article details Meituan's search‑engine optimization for its food‑delivery platform, describing the performance bottlenecks of Elasticsearch's inverted‑list query and merge phases, the design of a run‑length‑encoding (RLE) index, custom hash‑map term look‑ups, sparse RoaringBitmap structures, integration steps, and the resulting 84% latency reduction.

Architect

Jan 26, 2023

Optimizing Elasticsearch for High‑Concurrency LBS Search with an RLE‑Based Inverted Index

1. Introduction

In the past decade Elasticsearch has become the most popular open‑source search engine, serving offline data warehouses, near‑real‑time retrieval, and B‑side services. Meituan uses it as the core engine for its food‑delivery search, handling over a billion queries daily. Rapid growth in supply and data volume caused query latency and CPU load to increase sharply, especially in the inverted‑list query and merge stages.

2. Background

The LBS (Location‑Based Service) scenario requires filtering merchants by delivery range, which is represented by multiple electronic fences. Instead of using Geo‑Polygon queries, Meituan first obtains a candidate merchant list from an upstream R‑tree service and then issues a single Elasticsearch request such as:

POST food/_search
{
  "query": {
    "bool": {
      "must": {
        "term": { "spu_name": { "value": "烤鸭" } }
        // ...
      },
      "filter": {
        "terms": {
          "wm_poi_id": [1,3,18,27,28,29,...,37465542] // 上万
        }
      }
    }
  }
  // ...
}

Terms queries are fast for a single term (<0.001 ms), but when a search touches 20 000–30 000 merchants the total cost rises to 5–10 ms, making the merge phase a major bottleneck.

3. Challenges and Problems

Elasticsearch/Lucene stores posting lists on both memory and disk. A typical Terms query involves three steps: locating the term block on disk, reading the block into memory, and decoding the posting list. The complexity of each step is O(log N) for block lookup and O(K · M) for merging M terms with average posting length K, which becomes prohibitive at the scale of millions of postings.

The CPU flame‑graph shows that both posting‑list read and merge consume significant cycles, and the problem will worsen as data grows.

4. Technical Exploration and Practice

4.1 Inverted‑List Query Optimization

Because the terms are long integers, have exact‑match semantics, and the candidate set size is predictable, a hash‑map lookup (O(1)) can replace the Finite‑State‑Transducer (FST) used by default. Benchmarks with 100 000 <Long, Long> pairs show that HashMap/LongObjectHashMap provide O(1) query time at the cost of higher memory usage; the latter was chosen for production.

4.2 Inverted‑List Merge Optimization

Lucene’s native merge uses a Bitset (FixedBitSet) that iterates over every document ID, which is inefficient for sparse posting lists. RoaringBitmap is the default cache, but its container structures (ArrayContainer, BitmapContainer) do not support batch merging. The following pseudocode illustrates the original merge logic:

Input: termsEnum: 倒排表；termIterator：候选的term
Output: docIdSet : final docs set
for term in termIterator:
  if termsEnum.seekExact(term) != null:
    docs = read_disk()  // 磁盘读取
    docs = decode(docs) // 倒排链的decode流程
    for doc in docs:
      docIdSet.or(doc) //代码实现为DocIdSetBuilder.add。
end for
docIdSet.build() // 合并、排序，生成最终的 DocIdSetBuilder。

To avoid iterating over empty buckets, a sparse RoaringBitmap (SparseRoaringDocIdSet) was designed. It records only non‑empty buckets in an index array, allowing dense iteration over actual data.

class SparseRoaringDocIdSet {
  int[] index;       // 记录有 container 的 bucket Index
  Container[] denseSets; // 记录紧密排列的倒排链
}

4.3 RLE‑Based Posting‑List Format

After applying Index Sorting on fields such as city_id, geohash, and shop_id, the posting lists become strictly increasing integer sequences. Run‑Length Encoding (RLE) can therefore compress a posting list to a pair [start, length], reducing memory from dozens of kilobytes to a few bytes. The merge operation then becomes a sorting problem with O(M log M) complexity instead of O(K · M).

4.4 Integration

The new format is implemented as a custom PostingsFormat that hooks into Elasticsearch’s segment loading process, generates RLE posting tables for configured fields, and associates them with the segment. The plugin also provides a fallback to the native query path when the RLE index is unavailable, ensuring safe roll‑backs.

5. Performance Gains

Field‑level Index Sorting combined with Dense RoaringBitmap and RLE reduced the 99th‑percentile (TP99) latency of the full search chain from 100 ms to 16 ms – an 84 % improvement. CPU usage on data nodes dropped accordingly, as shown in the production flame‑graphs.

6. Conclusion and Outlook

The paper demonstrates a systematic approach: problem identification, traffic recording, profiling, hypothesis testing with JMH benchmarks, engineering integration, and gray‑release validation. The RLE‑based posting format solved the specific LBS search bottleneck, and the methodology can be applied to other large‑scale Elasticsearch workloads.

Future work includes automatic container selection (bitmap/array/RLE) based on data distribution and persisting RLE structures directly to index files to avoid in‑memory updates for heavy incremental indexing.

7. Author Biography

Zeyu, Zhang Cong, Xiao Peng and others are members of Meituan’s Delivery Business Group – Search Engineering Team.

8. References

[1] https://en.wikipedia.org/wiki/Run-length_encoding

[2] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl-geo-polygon-query.html

[3] https://en.wikipedia.org/wiki/Finite-state_transducer

[4] Frame of Reference and Roaring Bitmaps

[5] https://www.elastic.co/cn/blog/index-sorting-elasticsearch-6-0

[6] Chambi S, Lemire D, Kaser O, et al. Better bitmap performance with roaring bitmaps. Software: Practice and Experience, 2016.

[7] Lemire D, Ssi‑Yan‑Kai G, Kaser O. Consistently faster and smaller compressed bitmaps with roaring. Software: Practice and Experience, 2016.

[8] https://www.elastic.co/guide/cn/elasticsearch/guide/current/_fetch_phase.html#_fetch_phase

[9] Arroyuelo D, González S, Oyarzún M, et al. Document identifier reassignment and run‑length‑compressed inverted indexes for improved search performance. SIGIR 2013.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Performance Optimization Search Engine Elasticsearch inverted index RLE

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.