Databases 21 min read

How We Migrated a Multi‑Petabyte Elasticsearch Cluster Across Data Centers Without Downtime

This article details the end‑to‑end process of moving Qunar's massive Elasticsearch logging cluster from a saturated data‑center to a new facility, covering background constraints, migration planning, manual and automated steps, performance‑tuning parameters, shard‑balancing techniques, and the final outcomes achieved.

ITPUB
ITPUB
ITPUB
How We Migrated a Multi‑Petabyte Elasticsearch Cluster Across Data Centers Without Downtime

Background

Qunar's real‑time logging platform uses an ELK stack where the Elasticsearch (ES) cluster and Kibana reside in Data Center A, while Logstash runs in Data Center B. Data Center A was saturated, making it impossible to add new machines, and cross‑data‑center traffic caused peak‑time performance issues.

To resolve these problems, the team decided to migrate the entire ES cluster to Data Center B, aligning it with Logstash to improve network reliability.

Migration Plan

The ES logging platform ingests logs via Filebeat or Fluent‑Bit into Kafka, then Logstash and Flink write them to per‑appcode indices in ES. Kibana provides access with space‑>user‑>role‑>index‑pattern permissions.

Key migration challenges were:

Ensuring service availability and zero user impact during migration.

Improving migration speed for petabyte‑scale data while reducing manual effort.

Initial Manual Migration (November)

Nodes were excluded in batches using the cluster‑wide exclude._name setting, moving shards to remaining nodes. Example request:

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name": "data1_node1,data2_node1,...,data2_node5"
  }
}

During high‑write peaks, excluded nodes experienced high load and log‑queue buildup due to massive shard relocation and disk I/O saturation.

Automation Phase (Nov – Jan)

Based on manual experience, an automated workflow was built to evaluate cluster health, shard relocation count, and load before each batch migration. The process:

Check that cluster status is green and load thresholds ( load>30 on ≤7 nodes, load>50 on ≤3 nodes ) are met.

Verify the number of relocating shards; proceed only if ≤40.

Exclude the next batch of nodes (2 nodes during peak, 5 during off‑peak).

After relocation, bring the old nodes offline.

Automation reduced human effort and kept the number of migrating nodes stable.

Iterative Optimizations (Jan – Feb)

Several tuning parameters were introduced:

total_shards_per_node : limited shards per node to avoid skew (e.g., total_shards_per_node: 2 in index templates).

index.unassigned.node_left.delayed_timeout : increased to 120 minutes or set to a random value (100‑300 min) to spread shard recovery after node failures.

cluster_concurrent_rebalance : raised to 10 during normal operation, then set to 0 during migration to prevent concurrent rebalance that would add load.

Manual POST _cluster/reroute commands were used to move hot shards to less‑loaded nodes.

These adjustments lowered peak load, stabilized CPU and I/O, and cut migration time roughly in half.

Single‑Machine‑Per‑Node Migration

Instead of moving both data nodes on a machine together, the team migrated one data node at a time (e.g., data1_nodeX then data2_nodeX). This allowed larger batch sizes (up to 50 % more during peaks) while keeping load below 10.

Coordinate and Master Node Migration

Coordinate nodes (read/write routers) were moved in weekly batches of five, updating Logstash output configurations accordingly. Master nodes required careful handling of the discovery list; new masters were deployed in Data Center B before decommissioning old masters to avoid election failures.

Results and Lessons Learned

The three‑month effort completed the full cluster migration with zero downtime, finishing a week ahead of schedule despite a New Year traffic peak. Key takeaways include:

Plan migration strategies per node type and anticipate issues.

Automate wherever possible to improve efficiency and repeatability.

Deep understanding of ES parameters (total_shards_per_node, delayed_timeout, rebalance settings) is essential for stable large‑scale migrations.

References

https://www.elastic.co/cn/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/allocation-total-shards.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/delayed-allocation.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/index-modules-translog.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/cluster-reroute.html

https://cloud.tencent.com/developer/article/1334743?cps_key=6a15b90f1178f38fb09b07f16943cf3e

https://blog.csdn.net/laoyang360/article/details/108047071

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchperformance tuningCluster MigrationData centerShard Allocation
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.