Cloud Native 11 min read

How Yelp Re‑engineered Kafka on Kubernetes with PaaSTA: Architecture Deep Dive

Yelp migrated its massive Kafka deployment from EC2 to a Kubernetes‑based PaaSTA platform, introducing a custom Kafka operator and Cruise Control to cut provisioning time, simplify upgrades, improve lifecycle management, and reduce reliance on Puppet, while maintaining high‑throughput data pipelines.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
How Yelp Re‑engineered Kafka on Kubernetes with PaaSTA: Architecture Deep Dive

Yelp processes hundreds of billions of messages daily with Kafka and recently overhauled its deployment by running Kafka on its internal PaaSTA platform, which is built on Kubernetes. The new architecture leverages a custom Kafka Kubernetes operator and LinkedIn’s open‑source Cruise Control for lifecycle management.

Architecture Improvements and Motivation

Previously, all Kafka clusters ran on dedicated EC2 instances managed via Puppet, a process that took over two hours to create a new cluster. The redesign aimed to:

Reduce dependence on slow Puppet runs.

Promote internal adoption of PaaSTA and its CLI tools to boost productivity.

Improve maintainability of the lifecycle management system.

Simplify OS host upgrades and Kafka version upgrades.

Streamline creation of new Kafka clusters using the same deployment model as other services.

Accelerate broker decommissioning, simplify host‑failure recovery, and enable EBS volume re‑attachment to save network resources and costs.

Because Yelp already had experience running stateful workloads (e.g., Cassandra, Flink) on PaaSTA, it was a natural choice for Kafka.

The new deployment uses PaaSTA pools as the underlying infrastructure. Kafka broker pods are scheduled on Kubernetes nodes with detachable EBS volumes. Two key components are the Kafka operator and Cruise Control, each deployed per cluster.

Key differences from the old architecture include containerized Kafka brokers, removal of Puppet‑based configuration, and a unified YAML‑driven configuration pipeline that Jenkins propagates to the clusters.

Example PaaSTA configuration for a 15‑broker Kafka 2.4.1 cluster:

example-test-prod:
  deploy_group: prod.everything
  pool: kafka
  brokers: 15
  cpus: 5.7  # CPU unit reservation breakdown: (5.7 (kafka) + 0.1 (hacheck) + 0.1 (sensu)) + 0.1 (kiam) = 6.0
  mem: 26Gi
  data: 910Gi
  storage_class: gp2
  cluster_type: example
  cluster_name: test-prod
  use_cruise_control: true
  cruise_control_port: 12345
  service_name: kafka-2-4-1
  zookeeper:
    cluster_name: test-prod
    chroot: kafka-example-test-prod
    cluster_type: kafka_example_test
  config:
    unclean.leader.election.enable: "false"
    reserved.broker.max.id: "2113929216"
    request.timeout.ms: "300001"
    replica.fetch.max.bytes: "10485760"
    offsets.topic.segment.bytes: "104857600"
    offsets.retention.minutes: "10080"
    offsets.load.buffer.size: "15728640"
    num.replica.fetchers: "3"
    num.network.threads: "5"
    num.io.threads: "5"
    min.insync.replicas: "2"
    message.max.bytes: "1000000"
    log.segment.bytes: "268435456"
    log.roll.jitter.hours: "1"
    log.roll.hours: "22"
    log.retention.hours: "24"
    log.message.timestamp.type: "LogAppendTime"
    log.message.format.version: "2.4-IV1"
    log.cleaner.enable: "true"
    log.cleaner.threads: "3"
    log.cleaner.dedupe.buffer.size: "536870912"
    inter.broker.protocol.version: "2.4-IV1"
    group.max.session.timeout.ms: "300000"
    delete.topic.enable: "true"
    default.replication.factor: "3"
    connections.max.idle.ms: "3600000"
    confluent.support.metrics.enable: "false"
    auto.create.topics.enable: "false"
    transactional.id.expiration.ms: "86400000"

New Architecture Details

The Kafka Kubernetes operator manages the desired state of Kafka clusters. While ZooKeeper still stores metadata, broker data resides on persistent disks attached to the pods, making Kafka a stateful application in Kubernetes. Because Kubernetes lacks Kafka‑specific primitives, the operator acts as a custom controller that watches custom resources and interacts with the Cruise Control API to reconcile differences.

Cruise Control, an open‑source Kafka cluster management system from LinkedIn, reduces operational overhead by providing APIs for health checks, partition rebalancing, and broker addition/removal. Each Kafka cluster runs its own Cruise Control instance, and the operator invokes these APIs to perform lifecycle actions.

Both the operator and Cruise Control follow a similar pattern: they monitor cluster state, build an internal model, detect anomalies, and issue corrective actions via their respective APIs, replacing the previous ad‑hoc EC2‑based scripts that interacted with AWS services such as SNS and SQS.

Combined, these components form a complete architecture: a Custom Resource Definition (CRD) describes the Kafka cluster, the operator creates Kafka broker pods from a custom Docker image, and Cruise Control ensures balanced partitions and smooth scaling operations. Users can observe and interact with the cluster through the Cruise Control UI or the PaaSTA CLI.

A scaling‑down scenario illustrates the workflow: a developer updates the CRD to reduce broker count, the operator detects the mismatch, requests Cruise Control to remove the specified brokers, Cruise Control returns a task ID, the operator annotates the pods for decommission, monitors task progress via the API, and finally deletes the pods once the task completes, reconciling the actual state with the desired state.

What Happens After the Design?

Following the architecture redesign, Yelp built a migration process to move Kafka clusters from EC2 to PaaSTA seamlessly. Numerous clusters have already been migrated, and the team continues to fine‑tune hardware selections to match varying cluster characteristics.

The next article will detail the step‑by‑step strategy for migrating existing EC2‑based Kafka clusters to the Kubernetes‑based internal compute platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativearchitectureKubernetesOperatorKafkaCruise ControlPaaSTA
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.