How Meituan Cut Elasticsearch Search Latency by 84% with an RLE‑Based Inverted Index

This article details Meituan's search‑engine team optimization of Elasticsearch for a high‑traffic LBS scenario, describing the performance bottlenecks in term‑posting retrieval and merging, the design of a run‑length‑encoding (RLE) inverted index, its integration as a plugin, extensive benchmarking, and the resulting 84% reduction in TP99 query latency.

dbaplus Community
dbaplus Community
dbaplus Community
How Meituan Cut Elasticsearch Search Latency by 84% with an RLE‑Based Inverted Index

Meituan's food‑delivery search team faced growing latency and CPU load as the number of searchable merchants rose to tens of thousands per query. The primary hotspots were the inverted‑list (posting) retrieval and merge phases of Elasticsearch's term queries, which became CPU‑intensive at scale.

Problem Analysis

In the LBS‑driven workflow, a search request translates to a POST food/_search query with a massive terms filter containing up to hundreds of thousands of merchant IDs. While each term lookup is fast (<0.001 ms), the sheer volume (10 000+ terms) makes the total retrieval cost 5‑10 ms per request, and the subsequent merge of millions of postings dominates CPU usage.

Technical Exploration

The team examined Elasticsearch/Lucene's design: term dictionaries are stored as FSTs, posting lists are read from disk, decoded, and merged using a DocIdSetBuilder that iterates over each doc ID. This process has linear complexity O(K·M) where M is the number of terms and K the average posting length.

To address the two CPU hotspots—posting retrieval and posting merge—the following solutions were evaluated:

Replace FST lookup with a hash‑map (e.g., LongObjectHashMap) for O(1) term lookup.

Adopt RoaringBitmap as the underlying structure for posting sets.

Leverage Index Sorting to make doc IDs contiguous, turning sparse merges into dense sequential operations.

Design a custom Run‑Length Encoding (RLE) posting format that stores each posting list as [start, length] when IDs are contiguous.

Benchmarks showed that LongObjectHashMap offered the fastest term lookup, while an RLE‑based dense RoaringBitmap reduced merge time dramatically (TP99 from 13 ms to 0.5 ms for 10 000 postings of length 100).

RLE‑Based Posting Format

After applying index sorting on city_id, geohash, and poi_id, each merchant’s postings become a strictly increasing integer sequence. Encoding this sequence with RLE compresses it to two integers ( start and length), cutting memory from several kilobytes to a few bytes.

class SparseRoaringDocIdSet {
    int[] index;       // bucket indices that contain data
    Container[] denseSets; // tightly packed posting containers
}

The SparseRoaringDocIdSet records only non‑empty buckets, eliminating the overhead of empty containers during merges.

Integration into Elasticsearch

The solution was packaged as a non‑intrusive plugin:

Hook into the internal engine to load existing segments and replace their posting formats with the RLE‑based format.

Generate RLE posting tables for configured fields and associate them with segments.

Provide a new query builder that falls back to the original path when RLE data is unavailable, ensuring safe degradation.

Performance Gains

After deploying the plugin to production, the full‑stack query latency (TP99) dropped from 100 ms to 16 ms—a reduction of 84%. CPU consumption on data nodes also decreased significantly, as shown in the following charts:

Electronic fence illustration
Electronic fence illustration
CPU flame graph
CPU flame graph
Merge method evolution
Merge method evolution
Merge performance comparison
Merge performance comparison
Elasticsearch query flow
Elasticsearch query flow
Data node latency
Data node latency
Full‑stack latency
Full‑stack latency

Conclusion and Outlook

The RLE‑based inverted index, combined with hash‑map term lookup, dense RoaringBitmap storage, and index sorting, eliminated the dominant CPU hotspots in Meituan’s food‑delivery search service, achieving an 84% latency reduction and lower CPU usage. The approach is applicable to other large‑scale Elasticsearch deployments, especially where term posting lists become dense after sorting. Future work includes automatic container selection (bitmap/array/RLE) based on data distribution and persisting RLE postings at index‑write time to avoid runtime recompression.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchRoaringBitmapinverted indexBackend SearchRun-Length Encoding
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.