Operations 20 min read

How We Migrated a PB‑Scale Elasticsearch Cluster Across Data Centers Without Downtime

This article details the end‑to‑end migration of Qunar's massive Elasticsearch logging cluster from a saturated data‑center to a new facility, covering background constraints, migration planning, manual and automated node moves, performance tuning parameters, observed challenges, and the final zero‑downtime results.

dbaplus Community

Jun 13, 2023

How We Migrated a PB‑Scale Elasticsearch Cluster Across Data Centers Without Downtime

Background

Qunar's real‑time log platform uses an ELK stack where the Elasticsearch (ES) cluster and Kibana reside in Data Center A, while Logstash runs in Data Center B. Data Center A was saturated, making it impossible to add new machines, and cross‑data‑center traffic caused performance spikes.

Migration Goal

To resolve capacity and network issues, the team decided to move the entire ES cluster to Data Center B, colocating it with Logstash to improve network reliability.

Architecture Overview

The platform ingests logs via Filebeat/Fluent‑Bit into Kafka, then Logstash and Flink write them to ES indices (one index per appcode). The cluster consists of master, data, coordinate, Kibana, and service nodes, with over 500 ES nodes, PB‑scale storage, and trillions of documents.

Key Migration Challenges

Ensuring service availability and zero user impact during migration.

Improving migration speed for PB‑level data (single machine >10 TB).

Reducing manual effort and human error.

Migration Plan

The overall plan was to sort machines in Data Center A, migrate nodes batch‑by‑batch, shut down the old service, redeploy in Data Center B, and repeat until completion.

1. Manual Migration (November)

Manually exclude nodes using the cluster‑level setting.

Adjust parameters to keep the cluster stable.

Collect issues and define next steps.

PUT _cluster/settings</code><code>{</code><code>  "transient": {</code><code>    "cluster.routing.allocation.exclude._name": "data1_node1,data2_node1,...,data2_node5"</code><code>  }</code><code>}

During this phase the team observed high load spikes during write peaks because excluded nodes caused many relocating shards, increasing disk I/O.

2. Automated Migration (Nov – Jan)

After the manual phase, an automation framework was built to drive the migration based on cluster health, load, and relocating shard count.

Only proceed when cluster status is green and load thresholds ( load>30 on ≤7 nodes, load>50 on ≤3 nodes) are met.

Check that relocating shards ≤40 before starting a new batch.

After a batch finishes, bring the nodes offline.

POST _cluster/reroute</code><code>{</code><code>  "commands": [</code><code>    {</code><code>      "move": {</code><code>        "index": "log_appcode-2023.18",</code><code>        "shard": 59,</code><code>        "from_node": "data2_node1",</code><code>        "to_node": "data2_node10"</code><code>      }</code><code>    }</code><code>  ]</code><code>}

Adjusting batch size based on peak periods (2 nodes during high load, 5 nodes during low load) kept the cluster stable.

3. Iterative Optimisation (Jan – Feb)

Further tuning focused on shard allocation and node failure handling:

total_shards_per_node : limit the number of shards per node to avoid skew.

total_shards_per_node: shard_num/(nodes_count * 0.95 * 0.5)

index.unassigned.node_left.delayed_timeout : increase to 120 minutes or use a random value (100‑300 min) to spread recovery load.

PUT /index/_settings</code><code>{</code><code>  "settings": {</code><code>    "index.unassigned.node_left.delayed_timeout": 120m</code><code>  }</code><code>}

cluster.routing.allocation.cluster_concurrent_rebalance : set to 0 to stop automatic rebalance during migration, then perform manual reroute when needed.

These changes reduced peak load, lowered write‑backlog, and prevented node‑left failures.

Results

Migration speed more than doubled; the whole node set was moved a week earlier than planned.

Automation eliminated most manual effort.

Load during peak hours stayed within acceptable limits, and write‑backlog dropped significantly.

Conclusion

The multi‑stage approach—initial manual trials, automated batch migrations, and targeted parameter tuning—allowed Qunar to relocate a PB‑scale Elasticsearch cluster with zero downtime. The lessons on shard allocation, delayed recovery, and controlled rebalance are applicable to any large‑scale ES or similar distributed datastore migration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Automation Cluster Migration

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Migration Goal

Architecture Overview

Key Migration Challenges

Migration Plan

1. Manual Migration (November)

2. Automated Migration (Nov – Jan)

3. Iterative Optimisation (Jan – Feb)

Results

Conclusion

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

2. Automated Migration (Nov – Jan)

3. Iterative Optimisation (Jan – Feb)