How We Migrated a PB‑Scale Elasticsearch Cluster Across Data Centers Without Downtime
This article details the end‑to‑end migration of Qunar's massive Elasticsearch logging cluster from a saturated data‑center to a new facility, covering background constraints, migration planning, manual and automated node moves, performance tuning parameters, observed challenges, and the final zero‑downtime results.
Background
Qunar's real‑time log platform uses an ELK stack where the Elasticsearch (ES) cluster and Kibana reside in Data Center A, while Logstash runs in Data Center B. Data Center A was saturated, making it impossible to add new machines, and cross‑data‑center traffic caused performance spikes.
Migration Goal
To resolve capacity and network issues, the team decided to move the entire ES cluster to Data Center B, colocating it with Logstash to improve network reliability.
Architecture Overview
The platform ingests logs via Filebeat/Fluent‑Bit into Kafka, then Logstash and Flink write them to ES indices (one index per appcode). The cluster consists of master, data, coordinate, Kibana, and service nodes, with over 500 ES nodes, PB‑scale storage, and trillions of documents.
Key Migration Challenges
Ensuring service availability and zero user impact during migration.
Improving migration speed for PB‑level data (single machine >10 TB).
Reducing manual effort and human error.
Migration Plan
The overall plan was to sort machines in Data Center A, migrate nodes batch‑by‑batch, shut down the old service, redeploy in Data Center B, and repeat until completion.
1. Manual Migration (November)
Manually exclude nodes using the cluster‑level setting.
Adjust parameters to keep the cluster stable.
Collect issues and define next steps.
PUT _cluster/settings</code><code>{</code><code> "transient": {</code><code> "cluster.routing.allocation.exclude._name": "data1_node1,data2_node1,...,data2_node5"</code><code> }</code><code>}During this phase the team observed high load spikes during write peaks because excluded nodes caused many relocating shards, increasing disk I/O.
2. Automated Migration (Nov – Jan)
After the manual phase, an automation framework was built to drive the migration based on cluster health, load, and relocating shard count.
Only proceed when cluster status is green and load thresholds ( load>30 on ≤7 nodes, load>50 on ≤3 nodes) are met.
Check that relocating shards ≤40 before starting a new batch.
After a batch finishes, bring the nodes offline.
POST _cluster/reroute</code><code>{</code><code> "commands": [</code><code> {</code><code> "move": {</code><code> "index": "log_appcode-2023.18",</code><code> "shard": 59,</code><code> "from_node": "data2_node1",</code><code> "to_node": "data2_node10"</code><code> }</code><code> }</code><code> ]</code><code>}Adjusting batch size based on peak periods (2 nodes during high load, 5 nodes during low load) kept the cluster stable.
3. Iterative Optimisation (Jan – Feb)
Further tuning focused on shard allocation and node failure handling:
total_shards_per_node : limit the number of shards per node to avoid skew.
total_shards_per_node: shard_num/(nodes_count * 0.95 * 0.5)index.unassigned.node_left.delayed_timeout : increase to 120 minutes or use a random value (100‑300 min) to spread recovery load.
PUT /index/_settings</code><code>{</code><code> "settings": {</code><code> "index.unassigned.node_left.delayed_timeout": 120m</code><code> }</code><code>}cluster.routing.allocation.cluster_concurrent_rebalance : set to 0 to stop automatic rebalance during migration, then perform manual reroute when needed.
These changes reduced peak load, lowered write‑backlog, and prevented node‑left failures.
Results
Migration speed more than doubled; the whole node set was moved a week earlier than planned.
Automation eliminated most manual effort.
Load during peak hours stayed within acceptable limits, and write‑backlog dropped significantly.
Conclusion
The multi‑stage approach—initial manual trials, automated batch migrations, and targeted parameter tuning—allowed Qunar to relocate a PB‑scale Elasticsearch cluster with zero downtime. The lessons on shard allocation, delayed recovery, and controlled rebalance are applicable to any large‑scale ES or similar distributed datastore migration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
