Cloud Native 18 min read

Migrating from Eureka to Alibaba Nacos: A High‑Availability Sync Solution

Facing frequent service outages as their microservice count grew, MasterTeach migrated from Eureka to Alibaba Nacos, designing a high‑availability Nacos‑Eureka sync solution with consistent‑hash sharding, Zookeeper/Etcd coordination, automated DevOps integration, and extensive fault‑tolerance testing to ensure stable operation of over 660 services.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Migrating from Eureka to Alibaba Nacos: A High‑Availability Sync Solution

Background

Rapid growth of traffic and micro‑service count at a large online‑education platform caused the original Eureka registration center to become a single point of failure. Spring Cloud announced that Eureka is in maintenance mode, so a more robust, scalable registry was required.

Problem

Eureka could not handle the heartbeat load of ~660 services and lacked high availability, leading to service‑wide crashes.

Solution

After evaluating open‑source registries, Alibaba Nacos was selected as the new registration center for the Solar micro‑service ecosystem. A bidirectional synchronization layer between Eureka and Nacos was built to ensure seamless migration.

Sync scheme evolution

Official Nacos‑Eureka sync : A single sync server could not handle the load and had no HA.

Consistent‑hash + Zookeeper : Multiple sync servers; Zookeeper watches detect server failures and trigger re‑hash to redistribute services.

Master‑slave + Zookeeper : Backup server takes over on primary failure, but cost and complexity are higher than consistent‑hash sharding.

Consistent‑hash + Etcd : Service lists persisted in Etcd; Zookeeper watches handle failure detection; Etcd stores the hash ring and task assignments, reducing infrastructure.

Incremental service list updates : Full‑list periodic sync replaced by DevOps‑driven incremental updates via Nacos API; completed migrations are removed manually.

Server scaling : Deployment scaled from 8 × 4C8G servers to 12 and then 20 servers; further scaling to 8C16G is planned if needed.

Implementation details

The sync layer uses a consistent‑hash ring to assign services to sync servers. The number of virtual nodes (replicas) is configurable to achieve even distribution.

// Virtual node configuration
sync.consistent.hash.replicas = 1000;

// Build hash ring
SortedMap<Integer, T> circle = new TreeMap<>();
for (int i = 0; i < replicas; i++) {
    String nodeStr = node.toString().concat("##").concat(Integer.toString(i));
    int hashcode = getHash(nodeStr);
    circle.put(hashcode, node);
}

Etcd watches monitor server liveness. On a DELETE event the corresponding virtual nodes are removed and pending tasks are reassigned.

etcdManager.watchEtcdKeyAsync(REGISTER_WORKER_PATH, true, response -> {
    for (WatchEvent event : response.getEvents()) {
        if (event.getEventType().equals(WatchEvent.EventType.DELETE)) {
            String key = Optional.ofNullable(event.getKeyValue().getKey())
                                 .map(bs -> bs.toString(Charsets.UTF_8))
                                 .orElse("");
            String[] ks = key.split("/");
            log.info("{} lost heart beat", ks[3]);
            if (!IPUtils.getIpAddress().equalsIgnoreCase(ks[3])) {
                nodeCaches.remove(ks[3]);
                manager.deleteEtcdValueByKey(PER_WORKER_PROCESS_SERVICE.concat("/").concat(ks[3]), true);
            }
        }
    }
});

Task assignment uses the FNV1_32_HASH of the service name to locate the nearest node on the ring.

int hash = getHash(key.toString());
if (!circle.containsKey(hash)) {
    SortedMap<Integer, T> tailMap = circle.tailMap(hash);
    hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
}
T node = circle.get(hash);

High‑availability measures

Consistent‑hash sharding guarantees load balancing even when servers fail.

Zookeeper watches trigger immediate re‑hash and task migration.

Etcd lease TTL (default 30 s) with periodic heartbeat renewal ensures node liveness.

Automatic node recovery re‑adds virtual nodes to the ring and cleans up surplus tasks.

Deployment and testing

In FAT environment the sync cluster handled ~660 services without issues.

In PROD, scaling from 8 to 12 then 20 sync servers mitigated heartbeat loss.

Disaster‑recovery drills showed that with 8 of 9 sync nodes down only one service instance was lost; the system recovered within a minute after re‑hash.

Full upgrades of FAT, UAT and PROD environments were performed via Ansible.

Key takeaways

A consistent‑hash based synchronization layer, coordinated by Zookeeper/Etcd and driven by automated DevOps pipelines, can replace a legacy registration center at massive scale. The design provides sub‑minute recovery, stable operation for hundreds of micro‑services, and eliminates the single point of failure inherent in Eureka.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

service discoveryZooKeeperNacoseurekaetcdConsistent Hash
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.