Big Data 16 min read

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

This article provides a comprehensive guide to building enterprise‑grade, highly available Kafka clusters, covering architecture design, hardware planning, production‑level broker configurations, ISR management, monitoring, fault‑tolerance procedures, rolling upgrades, capacity planning, and automation scripts for seamless operations.

MaGe Linux Operations

Aug 19, 2025

Master Kafka High Availability: Replica Sync & Disaster Recovery Strategies

Kafka Cluster High Availability Deployment and Practical Configuration: Replica Sync and Disaster Recovery

Foreword: In the era of big data, Kafka's high availability directly impacts the stability of the entire data architecture. This article explores practical deployment solutions, including replica mechanisms, fault recovery strategies, and best practices for production environments.

Architecture Design: Building Enterprise‑Grade High‑Availability Kafka Clusters

Core Architecture Planning

For enterprise deployments, use an odd number of nodes (3 or 5) to ensure proper Zookeeper election.

# Recommended architecture configuration
cluster_size: 3-5 nodes (odd)
replication_factor: 3 (minimum recommended)
partition_strategy: adjust dynamically based on throughput
network_topology: rack‑aware deployment

Hardware Resource Planning

Resource recommendations for different workloads:

High‑throughput (TB‑scale per day): CPU 16+ cores, Memory 32 GB+ (heap 6‑8 GB), SSD RAID10 10 TB+ per node, 10 GbE network.

Medium load: CPU 8 cores, Memory 16 GB (heap 4‑6 GB), SSD 2 TB+, 1 GbE network.

Practical Production‑Level Parameter Tuning

Broker Core Configuration

# ===== High‑Availability Core Settings =====
# Cluster identification
broker.id=1
listeners=PLAINTEXT://kafka-node1:9092
advertised.listeners=PLAINTEXT://kafka-node1:9092

# Replica settings (critical)
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

# Log settings
log.dirs=/data/kafka-logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

# Replica sync settings
replica.lag.time.max.ms=30000
replica.fetch.max.bytes=1048576
num.replica.fetchers=4

# Failure detection settings
zookeeper.connection.timeout.ms=18000
zookeeper.session.timeout.ms=18000
controller.socket.timeout.ms=30000

JVM Tuning Parameters

# kafka-server-start.sh JVM configuration
export KAFKA_HEAP_OPTS="-Xmx6g -Xms6g"
export KAFKA_JVM_PERFORMANCE_OPTS="
  -server
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=20
  -XX:InitiatingHeapOccupancyPercent=35
  -XX:+ExplicitGCInvokesConcurrent
  -XX:MaxInlineLevel=15
  -Djava.awt.headless=true
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/kafka/
"

Deep Dive into Replica Synchronization Mechanism

ISR (In‑Sync Replicas) Management

ISR is the core concept for Kafka high availability; understanding its operation is essential for operations.

# View ISR status for a topic
kafka-topics.sh --bootstrap-server localhost:9092 \
  --describe --topic your-topic

# Monitor ISR shrinkage
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
  --describe --json | jq '.brokers[].logDirs[].partitions[] | select(.isr | length < 3)'

Replica Lag Monitoring

#!/bin/bash
# replica-lag-monitor.sh – monitor replica lag
KAFKA_HOME="/opt/kafka"
BOOTSTRAP_SERVERS="kafka-node1:9092,kafka-node2:9092,kafka-node3:9092"

# Get lag info for all topics
$KAFKA_HOME/bin/kafka-replica-verification.sh \
  --broker-list $BOOTSTRAP_SERVERS \
  --time -1 | grep "max lag" | while read line; do
    lag=$(echo $line | awk '{print $NF}')
    if [ $lag -gt 1000 ]; then
      echo "WARNING: High replica lag detected: $line"
      curl -X POST "http://alert-manager:9093/api/v1/alerts" \
        -H "Content-Type: application/json" \
        -d "[{\"labels\":{\"alertname\":\"KafkaHighReplicaLag\",\"severity\":\"warning\"}}]"
    fi
  done

Disaster Recovery Playbook

Common Failure Scenarios and Handling

Scenario 1: Broker Node Crash

Symptoms: ISR shrinkage, some partition leaders switch.

# 1. Verify node status
systemctl status kafka
journalctl -u kafka -f

# 2. Check cluster health
kafka-broker-api-versions.sh --bootstrap-server kafka-node2:9092

# 3. Restart the failed node
systemctl restart kafka

# 4. Verify node re‑joins the cluster
kafka-topics.sh --bootstrap-server localhost:9092 --describe

Scenario 2: Split‑Brain Issue

Prevention Configuration:

min.insync.replicas=2
acks=all
retries=2147483647
max.in.flight.requests.per.connection=1

Recovery Script:

# split‑brain-recovery.sh
for node in kafka-node1 kafka-node2 kafka-node3; do
  ssh $node "systemctl stop kafka"
done

zkCli.sh -server zk-node1:2181 <<EOF
rmr /brokers/ids
quit
EOF

for node in kafka-node1 kafka-node2 kafka-node3; do
  ssh $node "systemctl start kafka"
  sleep 30
 done

Scenario 3: Data Inconsistency Repair

# Use kafka-reassign-partitions for data repair
cat > reassignment.json <<EOF
{
  "version": 1,
  "partitions": [
    {"topic": "your-topic", "partition": 0, "replicas": [1,2,3]}
  ]
}
EOF

kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --execute

kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --verify

Monitoring and Alerting System

Key Metrics Monitoring

Prometheus + Grafana monitoring setup:

# prometheus-kafka-exporter configuration
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka-node1:9308','kafka-node2:9308','kafka-node3:9308']

# Critical metrics
- kafka_server_replicamanager_underreplicatedpartitions
- kafka_server_replicamanager_isrexpands_per_sec
- kafka_server_replicamanager_isrshrinks_per_sec
- kafka_server_brokertopicmetrics_messages_in_per_sec
- kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent

Automated Alert Scripts

#!/bin/bash
# kafka-health-check.sh – automated health check
KAFKA_HOME="/opt/kafka"
LOG_FILE="/var/log/kafka-health.log"
ALERT_THRESHOLD=5

check_cluster_health() {
  local unhealthy=0
  for broker in kafka-node1:9092 kafka-node2:9092 kafka-node3:9092; do
    if ! timeout 5 bash -c "echo > /dev/tcp/${broker/:/ }"; then
      echo "$(date): Broker $broker is unreachable" >> $LOG_FILE
      ((unhealthy++))
    fi
  done

  under_replicated=$($KAFKA_HOME/bin/kafka-topics.sh --bootstrap-server kafka-node1:9092 \
    --describe | grep -c "under replicated")
  if [ $under_replicated -gt 0 ]; then
    echo "$(date): Found $under_replicated under‑replicated partitions" >> $LOG_FILE
    ((unhealthy++))
  fi

  if [ $unhealthy -ge $ALERT_THRESHOLD ]; then
    send_alert "Kafka cluster health degraded: $unhealthy issues detected"
  fi
}

send_alert() {
  curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN" \
    -H "Content-Type: application/json" \
    -d "{\"msgtype\": \"text\", \"text\": {\"content\": \"$1\"}}"
}

while true; do
  check_cluster_health
  sleep 60
done

Advanced Operations Techniques

Rolling Upgrade Strategy

#!/bin/bash
KAFKA_VERSION_NEW="2.8.1"
NODES=("kafka-node1" "kafka-node2" "kafka-node3")

for node in "${NODES[@]}"; do
  echo "Upgrading $node..."
  ssh $node "systemctl stop kafka"
  ssh $node "cp -r /opt/kafka/config /opt/kafka/config.backup.$(date +%Y%m%d)"
  ssh $node "wget -O /tmp/kafka_2.13-${KAFKA_VERSION_NEW}.tgz https://archive.apache.org/dist/kafka/2.8.1/kafka_2.13-${KAFKA_VERSION_NEW}.tgz"
  ssh $node "tar -xzf /tmp/kafka_2.13-${KAFKA_VERSION_NEW}.tgz -C /opt/"
  ssh $node "ln -sfn /opt/kafka_2.13-${KAFKA_VERSION_NEW} /opt/kafka"
  ssh $node "systemctl start kafka"
  sleep 30
  kafka-broker-api-versions.sh --bootstrap-server $node:9092
  echo "$node upgrade completed, waiting before next node..."
  sleep 60
done

Performance Tuning Checklist

Disk I/O Optimization:

# Check disk I/O performance
iostat -x 1 10
# Optimize filesystem parameters
echo 'vm.swappiness=1' >> /etc/sysctl.conf
echo 'vm.dirty_ratio=80' >> /etc/sysctl.conf
echo 'vm.dirty_background_ratio=5' >> /etc/sysctl.conf
# SSD scheduler (prefer noop)
echo noop > /sys/block/sda/queue/scheduler

Network Parameter Tuning:

# TCP tuning
echo 'net.core.rmem_default=262144' >> /etc/sysctl.conf
echo 'net.core.rmem_max=16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_default=262144' >> /etc/sysctl.conf
echo 'net.core.wmem_max=16777216' >> /etc/sysctl.conf
sysctl -p

Capacity Planning and Scaling

Capacity Evaluation Model

#!/usr/bin/env python3
# kafka-capacity-planner.py

def calculate_storage_requirements(message_size_kb, messages_per_second, retention_days, replication_factor=3, compression_ratio=0.7):
    """Calculate Kafka storage requirements"""
    daily_data_gb = (message_size_kb * messages_per_second * 86400) / (1024 * 1024)
    total_data_gb = daily_data_gb * retention_days * compression_ratio
    cluster_storage_gb = total_data_gb * replication_factor
    recommended_storage_gb = cluster_storage_gb * 1.2
    return {
        'daily_data_gb': daily_data_gb,
        'total_logical_data_gb': total_data_gb,
        'cluster_storage_requirement_gb': cluster_storage_gb,
        'recommended_storage_gb': recommended_storage_gb
    }

result = calculate_storage_requirements(message_size_kb=2, messages_per_second=10000, retention_days=7)
print(f"Cluster storage requirement: {result['recommended_storage_gb']:.2f} GB")

Online Scaling Practice

#!/bin/bash
# online-scaling.sh – add a new broker
NEW_BROKER_ID=4
NEW_BROKER_HOST="kafka-node4"

# Prepare new broker config
cat > /tmp/new-broker-config.properties <<EOF
broker.id=${NEW_BROKER_ID}
listeners=PLAINTEXT://${NEW_BROKER_HOST}:9092
log.dirs=/data/kafka-logs
zookeeper.connect=zk-node1:2181,zk-node2:2181,zk-node3:2181/kafka
EOF

scp /tmp/new-broker-config.properties ${NEW_BROKER_HOST}:/opt/kafka/config/server.properties
ssh ${NEW_BROKER_HOST} "systemctl start kafka"

# Generate reassignment plan
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --topics-to-move-json-file topics.json \
  --broker-list "1,2,3,4" --generate > reassignment.json

# Execute reassignment
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --execute --throttle 50000000

# Monitor progress
watch "kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json --verify"

Conclusion and Best Practices

By following the detailed guidance above, operators can establish a complete Kafka high‑availability operation system. Key takeaways include:

Deploy an odd number of nodes to ensure Zookeeper election stability.

Set replication factor to 3 (minimum) and configure min.insync.replicas=2 for data consistency.

Use rack‑aware placement for disaster resilience.

Fine‑tune JVM heap (6 GB) and critical broker parameters.

Monitor ISR shrinkage, replica lag, and under‑replicated partitions.

Automate health checks, alerts, rolling upgrades, and capacity planning.

Mastering these practices equips engineers to maintain stable, reliable Kafka clusters in production environments.

Monitoring operations Kafka scaling high-availability replica-sync disaster-recovery

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.