Databases 17 min read

How Didi Scales Elasticsearch: Architecture, Performance Tuning, and Cost Optimization

This article details Didi’s use of Elasticsearch 7.6, covering its architecture on physical machines, gateway and control layers, user console features, deployment models, data synchronization strategies, engine iterations such as fine‑grained tiered protection, multi‑active replication, JDK 17/ZGC performance upgrades, cost reductions, multi‑tenant isolation, security enhancements, and future migration to ES 8.13.

dbaplus Community
dbaplus Community
dbaplus Community
How Didi Scales Elasticsearch: Architecture, Performance Tuning, and Cost Optimization

1. Introduction

Elasticsearch is an open‑source, distributed, RESTful full‑text search engine built on Lucene. Didi migrated its cluster from 2.x to 7.6.0 and uses it for online text search, log analysis, and vector‑search workloads such as map POI lookup, order queries, and customer‑service retrieval‑augmented generation (RAG).

2. Architecture

ES services run on physical machines, while the gateway and control layers are containerised. The control layer provides:

Intelligent segment merge to avoid segment bloat and Full GC on data nodes.

Index lifecycle management and pre‑creation of indices to prevent master/client OOM during midnight bulk creation.

Tenant isolation and rate‑limiting at three levels (AppID, index‑template, DSL).

SQL support built on an open‑source ES‑SQL engine.

The gateway forwards read/write traffic, rewrites BKD queries to range/equality queries, and enforces the rate‑limiting policies.

Architecture diagram
Architecture diagram

3. User Console

The console enables business teams to manage applications and indices:

Application Management: Request an AppID to obtain read/write permissions.

Index Management: Create indices, request read/write rights, modify mappings, and perform index cleanup or decommissioning.

Search can be performed via Kibana, DSL, or SQL. The console also shows anomaly and slow‑query analysis and monitors index metadata (document count, size) and read/write rates.

User console UI
User console UI

4. Operations & Control Platform

Provides daily RD and SRE tooling:

Cluster Management: Logical view of clusters that may span dozens of physical machines.

Tenant Management: Tenant metadata and tenant‑level rate limiting.

Template Management: Index‑template metadata, shard‑count adjustments via version upgrades, and template‑level throttling.

Anomaly Analysis: Template, slow‑query, and general anomaly detection.

Operation Logs: Records of user actions and scheduled tasks.

Operations platform UI
Operations platform UI

5. Application Scenarios

Online full‑text search (e.g., map POI lookup).

Real‑time MySQL snapshots for order queries.

One‑stop log search via Kibana (trace logs).

Time‑series analysis for security monitoring.

Simple OLAP for internal dashboards.

Vector search for customer‑service RAG.

6. Deployment Model

Physical‑machine + small‑cluster deployment, supporting up to ~100 physical nodes.

7. Data Synchronization

Two main approaches are used:

Real‑time sync: Log → ES, Kafka/Pulsar → ES, MySQL → ES via a unified DSINK tool built on Flink.

Offline sync: Hive → ES using batch load, MapReduce‑generated Lucene files, and a custom ES AppendLucene plugin.

Data sync diagram
Data sync diagram

8. Engine Iterations

8.1 Fine‑Grained Tiered Protection

Four protection levels (log, public, independent, dual‑active) isolate failures. Client‑node isolation separates read/write paths, and data‑node region isolation uses label‑based migration to move a problematic index away from healthy nodes.

8.2 Multi‑Active Replication (DCDR)

Didi Cross‑Datacenter Replication (DCDR) is a self‑developed push‑based replication that copies data from a leader index to a follower index with strong consistency. Key mechanisms include:

Checkpointing to avoid full data copy.

Sequence numbers to guarantee update ordering.

Write queue to prevent OOM during large transfers.

Reference URL: https://mp.weixin.qq.com/s?__biz=MzU1ODEzNjI2NA==∣=2247560579&idx=1&sn=bf11ced50067bb985a84213b4a1252db

8.3 Performance Upgrade: JDK 17 + ZGC

JDK 11‑G1 caused GC pauses >180 ms (P99). Testing showed ZGC keeps pauses <10 ms and JDK 17‑G1 improves throughput ~15 %.

Upgrade steps:

Upgrade ES JVM to JDK 17.

Migrate critical GC algorithm from G1 to ZGC.

Upgrade Groovy syntax, refactor plugins, update dependencies.

Deploy ZGC monitoring metrics.

Results:

Payment cluster P99 reduced from 800 ms to 30 ms (96 % reduction).

Average query latency dropped from 25 ms to 6 ms (75 % reduction).

Log cluster write throughput improved 15‑20 %.

Reference URL: https://mp.weixin.qq.com/s?__biz=MzU1ODEzNjI2NA==∣=2247560766&idx=1&sn=5023e43660bffc5ad337affd00a7e9c0

8.4 Cost Optimization

Optimize index mappings to disable unnecessary inverted/forward indexes.

Enable ZSTD compression, reducing CPU usage by ~15 %.

Integrate with a data‑asset management platform to prune unused partitions and indices.

Reference URL: https://mp.weixin.qq.com/s?__biz=MzU1ODEzNjI2NA==∣=2247560768&idx=1&sn=46cbb75b7294a97e7b0ff8b7bcba9b8c

8.5 Multi‑Tenant Resource Isolation

A thread‑pool‑group model, inspired by Presto Resource Groups, splits the original search thread pool into multiple sub‑pools, each configured per AppID with dedicated thread counts and queue sizes. This prevents a single tenant from exhausting CPU resources.

Thread pool group diagram
Thread pool group diagram

8.6 Security Enhancements

Two authentication layers are implemented:

Gateway level: AppID‑based authentication.

Node level: Authentication for client, data, and master nodes.

A lightweight security plugin performs HTTP‑level string validation, supports rolling restarts, and can be toggled on/off instantly.

Reference URL: https://mp.weixin.qq.com/s?__biz=MzU1ODEzNjI2NA==∣=2247560768&idx=2&sn=a2ecd43e6f417e750428204f316ecd9a

8.7 Stability Governance

Stability work follows a three‑stage approach: prevention, rapid detection/mitigation, and post‑mortem analysis.

Annual “mine‑clearing” projects fixed 61 issues (e.g., Gateway Full GC, ThreadLocal leaks).

Comprehensive monitoring of hardware, shard metrics, master pending tasks, and MQ lag.

Grafana dashboards for node, shard, and cluster health.

Self‑healing mechanisms for disk failures and query spikes; dual‑active traffic switchover drills.

AppID‑based rate limiting at the gateway (by index template and DSL) to protect core services.

Stability dashboard
Stability dashboard

9. Summary & Outlook

Didi’s ES 7.6 deployment now serves all internal online search workloads and is a benchmark for the big‑data architecture team. The next phase targets an upgrade to ES 8.13, which offers:

Improved master node performance.

Automatic disk‑balancing based on load.

Reduced segment memory footprint.

New features such as ANN vector search.

Performance goals for the upgrade include optimizing write throughput for update‑heavy workloads and refining query‑time merge strategies, while exploring machine‑learning capabilities introduced in newer ES versions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformanceElasticsearchSearchDataSyncMultitenant
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.