Big Data 24 min read

How to Achieve High Availability for Kafka Across Data Centers: Architectures, Trade‑offs, and Solutions

This article explains Kafka's cross‑data‑center high‑availability options, compares stretched and connected cluster designs, outlines typical failure scenarios, and reviews both community and commercial replication solutions, helping architects choose the most suitable deployment for their specific requirements.

Tencent Cloud Middleware

Apr 25, 2023

How to Achieve High Availability for Kafka Across Data Centers: Architectures, Trade‑offs, and Solutions

Background

Kafka is a widely used messaging middleware that often serves as a core component in data pipelines, making high availability a critical concern. Organizations increasingly need Kafka to remain available across multiple data centers, whether in private IDC, public cloud, or hybrid environments.

Key Terminology

RTO (Recovery Time Objective) : Maximum acceptable downtime during a failover.

RPO (Recovery Point Objective) : Maximum acceptable data loss during a failure.

Disaster Recovery : Strategies that allow an application to recover from a regional outage.

High Availability (HA) : Ability of a system to continue operating despite failures, including across regions.

Typical Failure Scenarios

Common failure cases that can affect Kafka include:

Single‑node failure – loss of a broker or its VM.

Rack or switch failure – loss of all nodes in a rack.

Data‑center (DC) failure – loss of all nodes in a DC.

Regional outage – loss of an entire availability zone.

Global incidents (DNS, routing) – complete service interruption.

Human error – accidental or malicious actions that corrupt data.

Different scenarios require different mitigation strategies, ranging from simple cluster replication to more complex multi‑DC architectures.

Kafka Architecture Overview

Kafka consists of Producers, Consumers, Brokers, and ZooKeeper. Brokers host topics, partitions, and replicas. High availability within a single data center is achieved through replica replication, but cross‑DC requirements often exceed the capabilities of a single‑DC cluster.

Cross‑Data‑Center Deployment Models

Two primary models are used:

Stretched Cluster – a single logical cluster spanning multiple data centers, using synchronous replication (acks=all) and configurable min.insync.replicas to ensure writes are acknowledged by brokers in at least two locations. This model provides 0 ms RTO/RPO when data‑center latency is low.

Connected Cluster – separate clusters in each data center linked by asynchronous replication tools (e.g., MirrorMaker, Replicator). RTO/RPO depend on network latency and replication lag, typically >0.

Specific Stretched‑Cluster Topologies

Various stretched‑cluster designs balance cost, latency, and fault tolerance:

2‑AZ Stretched Cluster : Two availability zones with synchronous replication. Manual configuration changes are required when a zone fails, resulting in non‑zero RTO/RPO.

2.5‑AZ Stretched Cluster : Two full zones plus a lightweight third zone. With three replicas (2+1 distribution) and acks=all, a failure of the lightweight zone does not affect writes, while loss of two full zones makes the cluster unavailable for writes but still readable.

3‑AZ Stretched Cluster : Three zones with three replicas (one per zone). Failure of any single zone yields 0 RTO/RPO, offering the simplest and most robust design.

All these designs rely on configuring min.insync.replicas and acks=all to guarantee that at least two data‑center replicas acknowledge each write.

Connected‑Cluster Topologies

Connected clusters use asynchronous mirroring tools:

Disaster‑Recovery (DR) Architecture : A primary cluster replicates to a standby cluster using MirrorMaker 2 (MM2). Failover requires moving producers and consumers to the standby site; RTO depends on how quickly the standby is “warmed up”.

Active‑Active (Dual‑Active) Architecture : Multiple data centers each run independent clusters that replicate to each other, enabling any site to serve traffic. This model introduces challenges such as replication lag, duplicate consumption, and configuration complexity.

Community Replication Solutions

Open‑source tools include:

MirrorMaker 1 (MM1) : Basic asynchronous mirroring with several limitations (no ACL sync, no exactly‑once semantics, manual partition management).

MirrorMaker 2 (MM2) : Built on Kafka Connect, resolves most MM1 issues, supports topic, configuration, ACL, and consumer‑offset synchronization, and allows complex topologies (e.g., A↔B, A→C, chain replication).

uReplicator (Uber) : An improved MM1 variant that uses Apache Helix for partition assignment and a REST API for topic management.

MM2 configuration details can be found in the official Kafka documentation (KIP‑382).

Commercial Replication Solutions

Vendors provide more integrated products:

Confluent Replicator : Copies topics, configurations, and offsets between clusters via the Connect framework. It offers monitoring metrics but does not migrate ACLs.

Confluent Cluster Linker : Directly links clusters, replicating topics with identical partitions and offsets, eliminating the need for a separate Connect cluster and reducing latency.

Confluent Multi‑Region Cluster (MRC) : Combines synchronous and asynchronous replication, introduces “Observer” replicas to reduce producer latency while preserving consistency, and supports automatic observer promotion.

Summary

Kafka cross‑data‑center high availability can be achieved with either a stretched (synchronous) cluster, which offers 0 RTO/RPO but requires low inter‑DC latency, or a connected (asynchronous) cluster, which is more flexible but incurs non‑zero RTO/RPO and additional operational complexity. Architects should evaluate failure scenarios, latency requirements, and management overhead to select the appropriate topology. Open‑source tools like MirrorMaker 2 provide a solid baseline, while commercial offerings such as Confluent Cluster Linker or Multi‑Region Cluster add features and simplify operations for production‑grade deployments.

high availability Kafka replication Cross‑Data‑Center MirrorMaker2 Connected Cluster Stretched Cluster

Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.