Migrating Kafka from EC2 to PaaSTA on Kubernetes: Strategy, Implementation, and Lessons Learned
This article details Yelp's end‑to‑end migration of Kafka clusters from EC2‑based brokers to Kubernetes‑based PaaSTA services, covering architectural changes, automated tooling, risk mitigation, rollback procedures, and practical lessons learned from the deployment process.
The article, translated from "Kafka on PaaSTA: Running Kafka on Kubernetes at Yelp (Part 2 – Migration)", explains how Yelp migrated existing Kafka clusters running on EC2 to a Kubernetes‑based internal platform called PaaSTA, ensuring zero downtime for producers and consumers.
Background : The original EC2 deployment used Auto Scaling Groups (ASG) with Elastic Load Balancers (ELB) as the cluster entry point, custom rebalance algorithms, and cron jobs for automatic partitioning. The migration replaces three key components – the cluster entry point, the rebalance algorithm, and the automatic partitioning logic – with Yelp’s service mesh, Cruise Control, and Tron jobs respectively.
A comparison table shows the differences between the EC2 and PaaSTA deployments for cluster entry, balancing, and automatic partitioning.
Migration Strategy Overview : The goal is a seamless switch from EC2‑compatible components to PaaSTA‑compatible ones without client downtime. The process begins by provisioning a PaaSTA‑based load balancer alongside the existing ELB, updating kafka_discovery files (generated by Puppet) to include the new service‑mesh endpoint, and automating their distribution via cron jobs instead of Puppet.
Example discovery file:
---
clusters:
uswest1-devc:
broker_list:
- kafka-example-cluster-elb-uswest1devc.<omitted>.<omitted>.com:9092
- kafka-example-cluster-elb-uswest1devc.<omitted>.<omitted>.com:9092
zookeeper: xx.xx.xx.xxx:2181,xx.xx.xx.xxx:2181,xx.xx.xx.xxx:2181/kafka-example-cluster-uswest1-devc
local_config:
cluster: uswest1-devc
...After the new components are in place, caches are cleared to avoid stale discovery data. The migration then proceeds in phases:
Deploy a dedicated Cruise Control instance with self‑healing disabled to avoid conflicts with the existing rebalance algorithm.
Launch a PaaSTA Kafka instance while keeping the original EC2 brokers running, effectively doubling the cluster size.
Once the PaaSTA brokers are healthy, create the __CruiseControlMetrics topic and disable the old automatic rebalance.
Use Cruise Control’s REST API to gradually move partitions from EC2 brokers to PaaSTA brokers, removing EC2 brokers from the ASG as they become empty.
Rollback is achieved by reversing the steps using Cruise Control’s add_broker API instead of remove_broker, and Terraform‑managed AWS resources can be reverted with a simple git revert.
Risks, Rollback, and Canary Releases : The main risk is the health of Cruise Control; therefore, instances are over‑provisioned and heavily monitored. Temporary cost increases arise from doubling the number of brokers during migration, but this is accepted to achieve faster migration. Canary migrations are performed using Kafka MirrorMaker to clone clusters before full production rollout.
Challenges and Learnings : Unhealthy Cruise Control instances caused instability due to offline partitions, requiring prior Kafka issue resolution. Adjusting Cruise Control’s configuration (e.g., reducing back‑track window) helped. The team learned performance differences between EC2 and Kubernetes deployments, refined resource sizing for Kafka pools, and concluded that the described migration approach was the most efficient for their environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
