How Didi Doubled Elasticsearch Write Throughput and Cut Server Costs
Didi’s engineering team analyzed a severe write bottleneck in their 3000‑node Elasticsearch cluster, identified long‑tail latency caused by refresh, translog locks, write queues and GC, and applied routing‑aware bulk writes, JVM and Lucene tweaks, and data cleaning to more than double write throughput while slashing server costs.
Background
Didi operates an Elasticsearch platform serving all internal search, logging, security analytics and metric workloads, with over 3000 nodes, 5 PB of data, more than a trillion documents, peak write throughput of 20 M writes per second and nearly one billion queries per day. After implementing hot‑cold data separation, the cluster began experiencing write rejections and latency spikes.
Write Bottleneck Analysis
Initial profiling showed low CPU (<50 %) and idle I/O, yet write speed plateaued at ~200 k writes/s on a 10‑node index and ~300 k writes/s on 16 nodes, indicating a server‑side bottleneck rather than client resources.
Key observations from bulk request slow‑log entries: items=7014 totalMills=2206 max=2203 avg=37 The maximum latency of a bulk request is determined by the slowest BulkShardRequest, revealing a classic long‑tail effect in distributed systems.
Elasticsearch Write Model
A bulk request is assembled on the client node, routed by the document’s routing value into multiple BulkShardRequest s, each sent asynchronously to a shard’s data node. The client waits for all shard responses before returning.
Root Causes of the Long Tail
Lucene refresh : Refresh operations are performed on write threads, causing occasional long pauses.
Translog ReadWriteLock : Write locks block concurrent writes, sometimes adding >100 ms latency.
Write queue saturation : When the write queue fills, requests wait (e.g., waitTime ≈ 200 ms) before execution.
JVM GC pauses : GC can stall writes for tens to hundreds of milliseconds.
Optimization Strategies
1. Write Model Optimization (Eliminate Long Tail)
Introduce a logical shard layer ( number_of_routing_size) and map a batch of bulk requests to a single physical shard using a consistent random routing value. The mapping formula:
slot = hash(random(value)) % (number_of_shards/number_of_routing_size)
shard_num = hash(_routing) % number_of_routing_size + number_of_routing_size * slotClient SDK is enhanced to generate a single logical shard per bulk batch, and the server forwards the batch to only that shard, drastically reducing the number of shard round‑trips.
2. Single‑Node Write Capability Boost
Back‑ported community improvements: disable refresh on _flush and _force_merge, and translog sync optimizations ( #45765, #47790), yielding ~18 % gain.
Optional translog disabling (with a dynamic switch) for high‑throughput indices, providing 10‑30 % extra speed.
Lucene write‑flow tweak to avoid synchronous segment flush, adding 7‑10 % improvement.
3. Data Cleaning
Removed two large redundant fields (args, response) from log documents, achieving ~20 % write speed increase and ~10 % storage savings.
Production Results
Write model gains : TPS rose from ~50 k/s to 120 k/s (≈2× increase) without changing hardware.
Write rejection reduction : Queue‑full rejections dropped dramatically, improving overall CPU utilization.
Latency mitigation : End‑to‑end write latency fell substantially, as shown by before/after latency charts.
The combined optimizations allowed Didi to support hot‑cold separation and large‑spec storage on fewer SSD machines, retiring over 400 physical servers and saving roughly ten million dollars annually.
Conclusion
Didi’s deep dive into Elasticsearch’s write path uncovered multiple long‑tail contributors. By routing bulk traffic to a single shard, tuning Lucene and translog behavior, and cleaning data, the team more than doubled write throughput, cut costs, and set the stage for further performance and stability enhancements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
