Didi's Large‑Scale Elasticsearch Upgrade: Architecture, Migration Strategy, and Performance Gains
This article systematically details Didi's migration of over 30 Elasticsearch clusters, 3,500 nodes and 8 PB of data from version 2.3.3 to 6.6.1, covering background, problem analysis, multi‑version architecture redesign, capacity planning, tiered storage, FastIndex, query replay, upgrade pitfalls, and the resulting cost reduction and performance improvements.
Background – Didi operated more than 40 Elasticsearch clusters with 3,500+ nodes and 8 PB of data on version 2.3.3. The goal was a seamless upgrade to 6.6.1 with zero impact on user queries and writes.
Problem Analysis – Four major domains were identified: engine incompatibility (file format, protocol), user‑side SDK incompatibility, resource constraints for dual‑write migration, and operational challenges such as multi‑cluster management and verification.
Upgrade Approach
Architecture redesign to support both 2.3.3 and 6.x versions via a multi‑version gateway and transparent SDK.
Resource preparation with capacity‑planning algorithms, tiered storage (SSD for hot data, Ceph for cold data), and the FastIndex plugin for efficient bulk import.
Operational improvements through an ES cluster management platform and Docker‑based deployment.
Practical steps: batch dual‑write, index migration, query replay, and iterative compatibility fixes.
Detailed Solution
Architecture – Multi‑version support, gateway‑level DSL compatibility, and a suite of services (Cluster Manager, Metadata Service, Admin console) for monitoring, capacity planning, and DSL analysis.
Upgrade Process – Create primary‑backup index pairs, pause writes, migrate historical data, enable dual‑write, perform DSL replay on both versions, compare results, and switch traffic once consistency is verified.
Gateway Compatibility – Transparent handling of HTTP/TCP SDK differences, DSL transformation, and a custom Java client to keep legacy TCP usage functional.
Cluster Management Platform – Provides cluster provisioning, scaling, upgrade orchestration, monitoring, and diagnostics.
Metadata Service – Collects cluster/node stats, DSL logs, and drives capacity planning and tiered‑storage decisions.
FastIndex – Offline data import component that reduced a 40‑node, 6‑hour import to 10 nodes and 1.5 hours for a 40 TB tag dataset.
Resource Optimizations
Capacity planning algorithm partitions nodes into regions, balances disk usage, and expands racks when high‑water marks are exceeded, raising average disk utilization from 40 % to 60 %.
Tiered storage moves hot data to SSD and cold data to Ceph, expanding usable storage from 5 PB to 8 PB.
Upgrade Benefits
Platform upgrades lowered operational cost and improved automation.
Machine count reduced by ~400, saving ~¥800,000 per month.
Query latency dropped 40 %, write throughput increased 30 %.
New features such as sequence numbers, ingest‑node throttling, DCDR replication, and improved shard allocation further enhanced stability and performance.
Conclusion & Outlook – Successful large‑scale ES upgrade demonstrates the importance of controllable architecture, platform usability, cost‑effective resource planning, and deep engine knowledge. Future plans include supporting 1500‑node clusters, multi‑tenant capabilities, cloud‑native ES services, and stronger query optimizations.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
