Upgrading Didi Elasticsearch to JDK 17 with ZGC: Challenges, Solutions, and Performance Gains
Didi upgraded its self‑developed Elasticsearch from JDK 11/G1 to JDK 17, adopting ZGC for latency‑critical clusters and tuned G1 for throughput, which eliminated long GC pauses, reduced query latency by up to 96%, cut CPU usage, and dramatically improved stability across multiple production clusters.
The article introduces the background of Didi's self‑developed Elasticsearch (ES) strong‑consistency multi‑active solution and explains why upgrading the ES runtime from JDK 11 (using G1 GC) to JDK 17 (with ZGC) was necessary to improve query performance and eliminate query spikes.
Background
In 2020 Didi upgraded its ES from version 2.x to 7.6.0, which runs on JDK 11 with the G1 garbage collector. Two main workload types exist: log‑heavy write‑intensive workloads (CPU usage ~85% at peak) and non‑log workloads (e.g., POI search, orders, payments) that demand low latency and high query stability.
As data volume grew, GC‑induced latency and instability became critical bottlenecks. The main problems were:
Long‑lasting Young‑generation GC pauses (hundreds of milliseconds, sometimes >1 s) especially in large‑memory clusters (e.g., 112 GB data nodes).
Frequent Full GC events caused massive stop‑the‑world pauses, leading to query timeouts and node disconnections.
JDK 11‑G1 memory reclamation issues that could not be solved by tuning G1 parameters.
Why JDK 17?
In early 2022 Didi launched a “sweeping” project to build a higher‑performance, lower‑latency, more stable ES search engine. Tests showed that ZGC on JDK 17 could keep GC pause times under 10 ms, effectively eliminating the query spikes caused by GC.
For high‑throughput log scenarios, JDK 17‑G1 also delivered a 15 % GC performance improvement and brought additional optimizations such as vectorization and better string handling.
Key Migration Steps
Gradle version upgrade : ES uses Gradle for project management. The original ES 7.6.0 was built with Gradle 6.0 (compatible with JDK 11). To run on JDK 17, Gradle was upgraded to ≥7.3, requiring syntax updates, plugin loading changes, and Groovy version upgrades (2.x → 3.0.7). { "gc": { "collectors": { "ZGC Cycles": { "collection_count": 242, "collection_time_in_millis": 97209 }, "ZGC Pauses": { "collection_count": 726, "collection_time_in_millis": 27 }, "ZGC AllocationStallCount": { "collection_count": 0, "collection_time_in_millis": 0 } } } }
Source compilation fixes : Replaced method references with lambda expressions to avoid a JDK compilation bug; upgraded dependent libraries (e.g., Jetty → 9+, Jackson → 2.3.3); refactored Groovy plugins to Java where syntax became invalid; added missing annotations such as @TaskAction.
Building a ZGC monitoring system : Extended ES metrics to expose ZGC Cycles and ZGC Pauses, and implemented a custom GC event listener to capture Allocation Stall events (which are not reported by default). Alerts were configured for G1 old‑generation usage, Full GC occurrences, and ZGC Allocation Stall.
Production Pitfalls & Optimizations
Allocation Stall caused by rapid allocation rates was mitigated by increasing heap size (e.g., from 31 GB to 64 GB), raising ZAllocationSpikeTolerance, and scaling GC threads (ConcGCThread).
Enabled dynamic GC thread count with -XX:+UseDynamicNumberOfGCThreads to reduce CPU pressure.
Disabled NUMA Auto‑Balance to avoid unpredictable stalls.
Applied class pointer compression (+UseCompressedClassPointers) for JVMs ≤31 GB.
Used SoftMaxHeapSize and -XX:ZUncommitDelay to return unused memory to the OS during bulk data updates.
Online Results
After three months of migration and tuning, JDK 17 was deployed on 15 ES clusters, delivering:
Significantly lower latency: P99 query time dropped from 800 ms to 30 ms (96% reduction) in one business cluster; similar 50% reductions in other clusters.
Higher throughput: CPU usage decreased by 20 % in a log‑heavy cluster, write‑reject rates fell, and write‑queue backlogs were eliminated.
Improved stability: Memory‑related alerts virtually disappeared; allocation‑stall monitoring prevented query spikes.
Overall, JDK 17 with ZGC (for latency‑sensitive clusters) and G1 (for throughput‑oriented clusters) provided a balanced solution, delivering lower latency, higher performance, and greater stability for Didi’s Elasticsearch services.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.