Backend Development 26 min read

Elasticsearch Query and Merge Optimization Using Run-Length Encoding for Meituan Takeaway Search

Meituan's food‑delivery search team identified heavy CPU and latency hotspots in Elasticsearch's posting‑list query and merge phases, then redesigned the inverted index using Run‑Length Encoding, hash‑based term lookup, index sorting and a custom SparseRoaringDocIdSet, ultimately reducing TP99 search latency by 84% and cutting CPU usage dramatically.

Meituan Technology Team

Nov 17, 2022

Elasticsearch Query and Merge Optimization Using Run-Length Encoding for Meituan Takeaway Search

1. Introduction

Elasticsearch has become the de‑facto open‑source search engine for offline data warehouses, near‑real‑time retrieval, and B‑side search. In Meituan's food‑delivery (外卖) search, the engine handles over a billion queries per day, but rapid growth in supply and data volume caused query latency and CPU load to rise sharply.

2. Background

The service is a classic Location‑Based Service (LBS) scenario: a merchant’s deliverable area is defined by multiple electronic fences. Instead of using a Geo‑Polygon query, Meituan first filters merchants via an upstream R‑tree service, then issues a standard Elasticsearch request that matches the merchant IDs.

POST food/_search
{
  "query": {
    "bool": {
      "must": {
        "term": { "spu_name": { "value": "烤鸭" } }
        // ...
      },
      "filter": {
        "terms": {
          "wm_poi_id": [1,3,18,27,28,29,...,37465542] // tens of thousands
        }
      }
    }
  }
  // ...
}

Terms queries are fast (≈0.001 ms per term), but when a single request touches 20 000–30 000 nearby merchants, the total cost rises to 5–10 ms per term, and the subsequent posting‑list merge becomes a dominant CPU hotspot.

3. Challenges and Problems

Elasticsearch/Lucene stores posting lists on both memory and disk. A typical query consists of two steps: (1) fetch the posting list for each term, and (2) merge the posting lists (OR operation). The original flow involves:

Locate the block containing the term in the on‑disk TermDictionary.

Read the block into memory.

Decode the posting list for merging.

The decoding and merging stages are CPU‑intensive; profiling shows they dominate the flame graph. The merge complexity is O(K·M + log(K·M)), where M is the number of terms and K the average posting‑list length, leading to millions of iterations in Meituan's scenario.

4. Technical Exploration and Practice

4.1 Posting‑List Query Optimization

Because Meituan’s terms are long integers, exact‑match only, and the term cardinality is predictable, the team replaced the Finite‑State Transducer (FST) index with a hash‑based lookup (HashMap / LongObjectHashMap). Benchmarks on 100 000 <Long, Long> pairs showed:

Memory: FST is most compact; hash maps use more memory.

Query time: FST O(len(term)), hash maps O(1). LongObjectHashMap performed best in latency.

Result: LongObjectHashMap was chosen for term lookup.

4.2 Posting‑List Merge Optimization

The merge originally uses a Bitset‑based DocIdSetBuilder that iterates over every doc ID, switching between sparse ArrayContainer and dense FixedBitSet. Profiling indicated an average of 7 ms per merge on production traffic.

Alternative data structures evaluated:

Native Lucene merge (TermInSetQuery).

Batch merge with FixedBitSet.

RoaringBitmap‑based merge.

RoaringBitmap was already used for query‑cache, but its native implementation could not batch‑merge sparse doc‑ID sets efficiently.

4.3 RLE‑Based Posting‑List Format

To exploit the LBS characteristic that all results for a request are geographically clustered, Meituan applied Index Sorting on city_id, geohash and poi_id. This makes each merchant’s posting list a strictly increasing integer sequence without gaps.

Run‑Length Encoding (RLE) compresses such sequences dramatically: a continuous range [1,2,…,1000] becomes [1,1000] (start, length). In the dense case, memory drops from 200 B to 4 B per posting list.

Implementation details:

public abstract class DocIdSetIterator {
  public abstract int docID();
  public abstract int nextDoc();
  public abstract int advance(int target);
}

Because the original PostingsEnum only supports single‑doc iteration, the team added batch‑read support and designed a new RLEContainer that stores posting lists as [start, length] pairs.

class SparseRoaringDocIdSet {
  int[] index;       // bucket indices that contain data
  Container[] denseSets; // tightly packed posting lists
}

The sparse implementation records only non‑empty buckets, eliminating the overhead of traversing empty buckets during merge.

4.4 Feature Integration

To keep compatibility with future Elasticsearch releases, the new format was implemented as a plug‑in that wraps the existing PostingsFormat. The integration steps are:

Add a hook to obtain the current InternalEngine during index loading.

Iterate all segments, read existing posting data, and generate the RLE‑based posting tables for configured fields.

Associate the RLE tables with their segments so that the search path can retrieve them.

Handle segment deletions and index updates to avoid memory leaks.

The online query path was also wrapped with a custom QueryBuilder that falls back to the native implementation when the RLE index is absent, ensuring safe degradation.

5. Performance Gains

After deploying the combined optimizations (Index Sorting + Dense RoaringBitmap + RLE), Meituan measured:

Full‑link TP99 latency reduced from 100 ms to 16 ms (‑84%).

Data‑node query latency similarly dropped by ~84%.

CPU consumption on the search service decreased dramatically.

Benchmarks on 10 000 posting lists (average length = 100) showed the RLE‑based Dense RoaringBitmap cut merge time by 96% (TP99 from 13 ms to 0.5 ms).

6. Conclusion and Outlook

The article demonstrates a systematic approach: (1) capture a realistic traffic set, (2) profile to locate CPU hotspots, (3) design and benchmark multiple solutions, (4) integrate the winning design as a backward‑compatible plug‑in, and (5) validate in production. The RLE‑based inverted index solved the specific LBS bottleneck, and the methodology can be reused for other Elasticsearch performance problems. Future work includes automatic container selection based on data distribution and handling massive incremental index updates without re‑encoding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Performance Optimization Elasticsearch Index Sorting RoaringBitmap inverted index Run-Length Encoding

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.