Big Data 34 min read

Kafka at Trillion-Scale: Ensuring High Availability, Performance, and Operational Best Practices

The article presents a comprehensive guide for running Kafka at trillion‑record daily traffic, detailing version upgrades, data migration, traffic throttling, monitoring, load balancing, resource isolation, security, disaster recovery, Linux tuning, platform automation, performance evaluation, future roadmap, and community contribution practices.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Kafka at Trillion-Scale: Ensuring High Availability, Performance, and Operational Best Practices

This article, authored by the vivo Internet Server Team, provides a comprehensive overview of operating a Kafka cluster when daily traffic reaches the trillion‑record level or higher. It outlines the capabilities required to maintain high availability, reliability, performance, throughput, and security.

The discussion focuses on Kafka 2.1.1 and covers topics such as version upgrades, data migration, traffic throttling, monitoring and alerting, load balancing, resource isolation, security authentication, disaster recovery, parameter tuning, platformization, performance evaluation, future roadmap, and community contribution.

1.1 Version Upgrade

The article explains how to perform rolling upgrades and rollbacks for the open‑source version, and how to handle source code modifications when mixing upgraded and unmodified nodes.

Official documentation: http://kafka.apache.org

1.2 Data Migration

Kafka’s built‑in script bin/kafka-reassign-partitions.sh can be used for manual rebalancing, but for large clusters an automated solution is recommended. Two migration plan examples are provided:

{
    "version":1,
    "partitions":[
        {"topic":"yyj4","partition":0,"replicas":[1000003,1000004]},
        {"topic":"yyj4","partition":1,"replicas":[1000003,1000004]},
        {"topic":"yyj4","partition":2,"replicas":[1000003,1000004]}
    ]
}

Plan with explicit log directories:

{
    "version":1,
    "partitions":[
        {"topic":"yyj1","partition":0,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]},
        {"topic":"yyj1","partition":1,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]},
        {"topic":"yyj1","partition":2,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]}
    ]
}

Migration can be broker‑to‑broker, intra‑broker disk, or concurrent. The open‑source version only supports serial migration; concurrency requires upgrading to Kafka 2.6.0 or custom code changes.

1.3 Traffic Limiting

To protect the cluster from sudden spikes, production and consumption traffic can be throttled per user, client‑id, or default settings. Configuration paths such as /config/users/<user>/clients/<client-id> are described. JMX metrics for producer and consumer byte‑rate and throttle‑time are used for dynamic adjustments.

Official documentation: http://kafka.apache.org

1.4 Monitoring & Alerting

Open‑source tools (Kafka Manager, Kafka Eagle, Kafka Monitor, KafkaOffsetMonitor) are mentioned, but a custom, fine‑grained monitoring platform is recommended. Monitoring dimensions include hardware, OS, broker service, client applications, Zookeeper, and end‑to‑end pipelines.

Key broker metrics: network threads, I/O threads, replica fetchers, compression type, request queues, GC, etc. Sample configuration snippets:

num.network.threads   # recommended = CPU cores * 2
num.io.threads        # recommended = number of disks * 2
num.replica.fetchers # recommended = CPU cores / 4
compression.type      # use lz4
queued.max.requests   # >= 500 in production
auto.leader.rebalance.enable # set to false

Client‑side metrics reporter example:

Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, IntegerSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// ClientMetricsReporter implements MetricsReporter
props.put(ProducerConfig.METRIC_REPORTER_CLASSES_CONFIG, ClientMetricsReporter.class.getName());
...

1.5 Resource Isolation

Physical isolation of business groups within the same cluster is described, using resource groups (e.g., Group1, Group2) to prevent one business’s traffic surge from affecting another.

1.6 Cluster Classification

Clusters are categorized by workload (log, monitoring, billing, search, offline, online) to avoid cross‑impact.

1.7 Scaling (Expand/Shrink)

Guidelines for adding or removing broker nodes, including intelligent evaluation, automated planning, and platform‑driven execution.

1.8 Load Balancing

Manual and automated rebalancing using tools like LinkedIn Cruise Control are discussed. The migration plan generation should consider core metrics (in/out traffic, rack awareness, partition dispersion) and avoid moving already balanced partitions.

1.9 Security Authentication

ACLs for producers, consumers, and data‑directory migrations are required. Documentation: http://kafka.apache.org

1.10 Disaster Recovery

Cross‑rack and cross‑cluster replication (MirrorMaker 2.0) are suggested for high availability.

1.11 Parameter & Configuration Optimization

Key broker and client parameters are listed to improve throughput and latency.

1.12 Linux Server Tuning

File‑handle limits, page‑cache settings, and other OS‑level tweaks are recommended (see the linked article on PageCache tuning).

1.13 Platformization

Operations such as configuration management, rolling restarts, cluster management, mock data generation, permission management, and automated scaling are advocated to reduce manual SSH work.

1.14 Performance Evaluation

Evaluations at broker, topic‑partition, disk, and cluster scale levels help guide capacity planning and threshold settings.

1.15 Network Architecture (DNS + LVS)

For very large clusters, using a DNS name backed by LVS load balancers simplifies client bootstrap configuration.

2 Open‑Source Version Limitations

Identified gaps in Kafka 2.1.1 include lack of incremental migration, concurrent migration, migration termination, and bugs related to data‑directory migrations. Workarounds and custom patches are described.

3 Future Trends

Links to Kafka community roadmap, KIP‑500 (ZooKeeper removal), KIP‑630 (Raft controller), tiered storage, partition reduction, MirrorMaker2 exactly‑once, and other upcoming features.

4 Contributing to the Community

Guidance on where to submit patches, wiki contributions, issue trackers, and a list of main committers.

data migrationmonitoringperformanceload balancingKafkasecurityHigh Availability
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.