Cloud Native 12 min read

Migrating Kafka from EC2 to PaaSTA on Kubernetes: Strategy, Implementation, and Lessons Learned

This article details Yelp's end‑to‑end migration of Kafka clusters from EC2‑based brokers to Kubernetes‑based PaaSTA services, covering architectural changes, automated tooling, risk mitigation, rollback procedures, and practical lessons learned from the deployment process.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Migrating Kafka from EC2 to PaaSTA on Kubernetes: Strategy, Implementation, and Lessons Learned

The article, translated from "Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 2 – Migration)", explains how Yelp migrated existing Kafka clusters running on EC2 to a Kubernetes‑based internal platform called PaaSTA, ensuring zero downtime for producers and consumers.

Background : The original EC2 deployment used Auto Scaling Groups (ASG) with Elastic Load Balancers (ELB) as the cluster entry point, custom rebalance algorithms, and cron jobs for automatic partitioning. The migration replaces three key components – the cluster entry point, the rebalance algorithm, and the automatic partitioning logic – with Yelp’s service mesh, Cruise Control, and Tron jobs respectively.

A comparison table shows the differences between the EC2 and PaaSTA deployments for cluster entry, balancing, and automatic partitioning.

Migration Strategy Overview : The goal is a seamless switch from EC2‑compatible components to PaaSTA‑compatible ones without client downtime. The process begins by provisioning a PaaSTA‑based load balancer alongside the existing ELB, updating kafka_discovery files (generated by Puppet) to include the new service‑mesh endpoint, and automating their distribution via cron jobs instead of Puppet.

Example discovery file:

---
clusters:
  uswest1-devc:
    broker_list:
    - kafka-example-cluster-elb-uswest1devc.<omitted>.<omitted>.com:9092
    - kafka-example-cluster-elb-uswest1devc.<omitted>.<omitted>.com:9092
    zookeeper: xx.xx.xx.xxx:2181,xx.xx.xx.xxx:2181,xx.xx.xx.xxx:2181/kafka-example-cluster-uswest1-devc
    local_config:
      cluster: uswest1-devc
      ...

After the new components are in place, caches are cleared to avoid stale discovery data. The migration then proceeds in phases:

Deploy a dedicated Cruise Control instance with self‑healing disabled to avoid conflicts with the existing rebalance algorithm.

Launch a PaaSTA Kafka instance while keeping the original EC2 brokers running, effectively doubling the cluster size.

Once the PaaSTA brokers are healthy, create the __CruiseControlMetrics topic and disable the old automatic rebalance.

Use Cruise Control’s REST API to gradually move partitions from EC2 brokers to PaaSTA brokers, removing EC2 brokers from the ASG as they become empty.

Rollback is achieved by reversing the steps using Cruise Control’s add_broker API instead of remove_broker, and Terraform‑managed AWS resources can be reverted with a simple git revert.

Risks, Rollback, and Canary Releases : The main risk is the health of Cruise Control; therefore, instances are over‑provisioned and heavily monitored. Temporary cost increases arise from doubling the number of brokers during migration, but this is accepted to achieve faster migration. Canary migrations are performed using Kafka MirrorMaker to clone clusters before full production rollout.

Challenges and Learnings : Unhealthy Cruise Control instances caused instability due to offline partitions, requiring prior Kafka issue resolution. Adjusting Cruise Control’s configuration (e.g., reducing back‑track window) helped. The team learned performance differences between EC2 and Kubernetes deployments, refined resource sizing for Kafka pools, and concluded that the described migration approach was the most efficient for their environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

migrationcloud-nativeKafkaPaaSTAcruise-control
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.