Operations 17 min read

Zero‑Downtime Migration of 2000 Microservices to a Multi‑Cluster Managed Kafka at Wix

This article details how Wix migrated its 2000 microservices from a self‑hosted Kafka cluster to a multi‑cluster Confluent Cloud setup with zero downtime, describing the design decisions, partitioning strategy, Greyhound SDK usage, replication service, migration orchestrator, and best‑practice recommendations.

Architecture Digest

Apr 18, 2022

Zero‑Downtime Migration of 2000 Microservices to a Multi‑Cluster Managed Kafka at Wix

In 2021 our team migrated Wix's 2000 microservices from a self‑hosted Kafka cluster to a multi‑cluster Confluent Cloud platform (Confluent Enterprise's managed service) in a seamless, zero‑downtime fashion without requiring service owners to intervene.

This was the most challenging project I have led; in this article I share the key design decisions, best practices, and tips for such a migration.

1. Split Overloaded Clusters

In recent years the growing number of services in an event‑driven architecture put increasing load on Wix's OLTP services and Kafka. Our self‑hosted cluster grew from 5K topics and >45K partitions in 2019 to 20K topics and >200K partitions in 2021, with traffic rising from ~450 million records per day to >2.5 billion records per day per cluster.

Because all metadata had to be loaded into partitions, controller start‑up time suffered, leader election slowed, and a single broker failure could dramatically increase start‑up latency, causing many under‑replicated partitions.

To avoid instability in production we decided to migrate the self‑hosted Kafka clusters to Confluent Cloud and split each single‑cluster per data‑center into multiple clusters.

2. Why Managed Kafka?

Managing a Kafka cluster is hard, especially tasks such as rebalancing partitions or upgrading broker versions, which become painful when metadata is overloaded. Using a cloud‑native Kafka platform like Confluent Cloud offers four main benefits:

Better cluster performance and flexibility : Rebalancing prevents brokers from becoming bottlenecks and allows easy scaling.

Transparent version upgrades : With KIP‑500 metadata is stored in Kafka partitions instead of ZooKeeper, eliminating the need for an external metadata system and enabling automatic upgrades.

Easy addition of new clusters : Setting up a new cluster is straightforward.

Tiered storage : Older records can be moved to cheap S3 storage, extending retention without high disk costs.

3. Switching 2000 Microservices to a Multi‑Cluster Kafka Architecture

Wix uses a standard JVM library and proxy service called Greyhound to interact with Kafka. Greyhound is an open‑source SDK that provides features such as concurrent message handling, batching, and retries.

Greyhound serves as the backbone for roughly 2200 microservices; introducing the multi‑cluster concept only required changes in a few places (the library and proxy). New topics must explicitly specify the target cluster, and developers simply define the logical cluster name when building a Greyhound producer or consumer.

4. How to Split

We split Kafka clusters based on Service‑Level Agreements (SLAs). For example, CI/CD pipelines and data‑migration cases have different SLAs from production services. Not every data‑center hosts the same number of clusters.

5. Exhausting Traffic of a Data‑Center?

The initial design tried to exhaust all traffic of a data‑center before switching producers and consumers to the new clusters. This proved impossible because some services are deployed only in a single data‑center, making a full cut‑over risky and potentially causing data loss.

Instead we planned a new design that performs migration during live traffic.

6. Zero‑Downtime Migration

Live‑traffic migration requires meticulous planning and step‑wise execution, with automation and a rollback mechanism to minimise the “blast radius”. Producers are migrated after all consumers have been safely switched, ensuring no message loss.

Replication

To guarantee no loss during migration we built a dedicated replicator service. Once consumer topics are identified, the replicator creates the corresponding topics in the target cloud cluster, consumes records from the self‑hosted cluster, and writes them to the destination.

Consumer Migration

The replicator also provides offset mapping for each partition so that Greyhound consumers can resume processing from the correct offset in the cloud cluster, derived from the first uncommitted offset in the self‑hosted cluster.

If a failure occurs, the orchestrator can request consumers to fall back to the self‑hosted cluster.

Producer Migration

Once all consumers for a topic have migrated, its producers can be switched. The initial design cached incoming requests in memory, which was deemed risky. We instead leveraged Wix's progressive Kubernetes deployment: a new pod only starts accepting traffic after all health checks pass, ensuring that older pods continue serving requests while new pods connect to the new cluster.

Greyhound producers simply read the database at pod start‑up to determine which cluster to connect to, avoiding dynamic switching and request caching.

Replication Bottleneck

Producer migration can only finish after all consumers have left the source cluster. Many topics have multiple consumers from different services, increasing the load on the replicator. Our older self‑hosted brokers limited the number of topics a consumer could handle; attempts to increase message.max.bytes triggered the KAFKA‑9254 bug. We resolved this by scaling out replicator consumers and sharding the topics among them.

Migration Beyond – External Consumer Control

The “live‑traffic” migration design enables dynamic reconfiguration of Greyhound consumers without deploying new code. Commands can be sent to consumers to:

Switch clusters (unsubscribe from the current cluster and subscribe to another).

Skip unprocessable records.

Adjust processing rate or add throttling delays.

Rebalance records across partitions when latency grows.

7. Best Practices and Tips

Create a script that checks the migration state and aborts if expectations are not met.

Prepare rollback procedures for every stage and test them extensively.

Start with test/relay topics that have no production impact.

Build custom monitoring dashboards to show current and historical migration status.

Ensure self‑hosted Kafka brokers are on the latest patch version before migration.

8. Summary

Using Greyhound, a dedicated orchestrator, and automation scripts, we achieved a seamless, zero‑downtime migration of 2000 microservices during live traffic.

If you can fully exhaust a data‑center's traffic or tolerate brief processing pauses, switch producers and consumers to the new clusters first; this simplifies the migration and saves time.

Otherwise, when migrating under load, follow the strict order (consumer before producer or vice‑versa) and understand the implications for rollback and potential data loss.

Author: Natan Silnitsky, Backend Infrastructure Engineer at Wix.

Original article: Migrating to a Multi‑Cluster Managed Kafka with 0‑Downtime

Kafka Confluent Cloud Greyhound SDK zero‑downtime migration

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.