Big Data 17 min read

How Didi Doubled Elasticsearch Write Throughput and Cut Server Costs

Didi’s engineering team analyzed a severe write bottleneck in their 3000‑node Elasticsearch cluster, identified long‑tail latency caused by refresh, translog locks, write queues and GC, and applied routing‑aware bulk writes, JVM and Lucene tweaks, and data cleaning to more than double write throughput while slashing server costs.

dbaplus Community
dbaplus Community
dbaplus Community
How Didi Doubled Elasticsearch Write Throughput and Cut Server Costs

Background

Didi operates an Elasticsearch platform serving all internal search, logging, security analytics and metric workloads, with over 3000 nodes, 5 PB of data, more than a trillion documents, peak write throughput of 20 M writes per second and nearly one billion queries per day. After implementing hot‑cold data separation, the cluster began experiencing write rejections and latency spikes.

Write Bottleneck Analysis

Initial profiling showed low CPU (<50 %) and idle I/O, yet write speed plateaued at ~200 k writes/s on a 10‑node index and ~300 k writes/s on 16 nodes, indicating a server‑side bottleneck rather than client resources.

Key observations from bulk request slow‑log entries: items=7014 totalMills=2206 max=2203 avg=37 The maximum latency of a bulk request is determined by the slowest BulkShardRequest, revealing a classic long‑tail effect in distributed systems.

Elasticsearch Write Model

A bulk request is assembled on the client node, routed by the document’s routing value into multiple BulkShardRequest s, each sent asynchronously to a shard’s data node. The client waits for all shard responses before returning.

Root Causes of the Long Tail

Lucene refresh : Refresh operations are performed on write threads, causing occasional long pauses.

Translog ReadWriteLock : Write locks block concurrent writes, sometimes adding >100 ms latency.

Write queue saturation : When the write queue fills, requests wait (e.g., waitTime ≈ 200 ms) before execution.

JVM GC pauses : GC can stall writes for tens to hundreds of milliseconds.

Optimization Strategies

1. Write Model Optimization (Eliminate Long Tail)

Introduce a logical shard layer ( number_of_routing_size) and map a batch of bulk requests to a single physical shard using a consistent random routing value. The mapping formula:

slot = hash(random(value)) % (number_of_shards/number_of_routing_size)
shard_num = hash(_routing) % number_of_routing_size + number_of_routing_size * slot

Client SDK is enhanced to generate a single logical shard per bulk batch, and the server forwards the batch to only that shard, drastically reducing the number of shard round‑trips.

2. Single‑Node Write Capability Boost

Back‑ported community improvements: disable refresh on _flush and _force_merge, and translog sync optimizations ( #45765, #47790), yielding ~18 % gain.

Optional translog disabling (with a dynamic switch) for high‑throughput indices, providing 10‑30 % extra speed.

Lucene write‑flow tweak to avoid synchronous segment flush, adding 7‑10 % improvement.

3. Data Cleaning

Removed two large redundant fields (args, response) from log documents, achieving ~20 % write speed increase and ~10 % storage savings.

Production Results

Write model gains : TPS rose from ~50 k/s to 120 k/s (≈2× increase) without changing hardware.

Write rejection reduction : Queue‑full rejections dropped dramatically, improving overall CPU utilization.

Latency mitigation : End‑to‑end write latency fell substantially, as shown by before/after latency charts.

The combined optimizations allowed Didi to support hot‑cold separation and large‑spec storage on fewer SSD machines, retiring over 400 physical servers and saving roughly ten million dollars annually.

Conclusion

Didi’s deep dive into Elasticsearch’s write path uncovered multiple long‑tail contributors. By routing bulk traffic to a single shard, tuning Lucene and translog behavior, and cleaning data, the team more than doubled write throughput, cut costs, and set the stage for further performance and stability enhancements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchdata optimizationLong TailDidiWrite Performancebulk API
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.