Kafka at Trillion-Scale: Ensuring High Availability, Performance, and Operational Best Practices
The article presents a comprehensive guide for running Kafka at trillion‑record daily traffic, detailing version upgrades, data migration, traffic throttling, monitoring, load balancing, resource isolation, security, disaster recovery, Linux tuning, platform automation, performance evaluation, future roadmap, and community contribution practices.
This article, authored by the vivo Internet Server Team, provides a comprehensive overview of operating a Kafka cluster when daily traffic reaches the trillion‑record level or higher. It outlines the capabilities required to maintain high availability, reliability, performance, throughput, and security.
The discussion focuses on Kafka 2.1.1 and covers topics such as version upgrades, data migration, traffic throttling, monitoring and alerting, load balancing, resource isolation, security authentication, disaster recovery, parameter tuning, platformization, performance evaluation, future roadmap, and community contribution.
1.1 Version Upgrade
The article explains how to perform rolling upgrades and rollbacks for the open‑source version, and how to handle source code modifications when mixing upgraded and unmodified nodes.
Official documentation: http://kafka.apache.org
1.2 Data Migration
Kafka’s built‑in script bin/kafka-reassign-partitions.sh can be used for manual rebalancing, but for large clusters an automated solution is recommended. Two migration plan examples are provided:
{
"version":1,
"partitions":[
{"topic":"yyj4","partition":0,"replicas":[1000003,1000004]},
{"topic":"yyj4","partition":1,"replicas":[1000003,1000004]},
{"topic":"yyj4","partition":2,"replicas":[1000003,1000004]}
]
}Plan with explicit log directories:
{
"version":1,
"partitions":[
{"topic":"yyj1","partition":0,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]},
{"topic":"yyj1","partition":1,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]},
{"topic":"yyj1","partition":2,"replicas":[1000006,1000005],"log_dirs":["/data1/bigdata/mydata1","/data1/bigdata/mydata3"]}
]
}Migration can be broker‑to‑broker, intra‑broker disk, or concurrent. The open‑source version only supports serial migration; concurrency requires upgrading to Kafka 2.6.0 or custom code changes.
1.3 Traffic Limiting
To protect the cluster from sudden spikes, production and consumption traffic can be throttled per user, client‑id, or default settings. Configuration paths such as /config/users/<user>/clients/<client-id> are described. JMX metrics for producer and consumer byte‑rate and throttle‑time are used for dynamic adjustments.
Official documentation: http://kafka.apache.org
1.4 Monitoring & Alerting
Open‑source tools (Kafka Manager, Kafka Eagle, Kafka Monitor, KafkaOffsetMonitor) are mentioned, but a custom, fine‑grained monitoring platform is recommended. Monitoring dimensions include hardware, OS, broker service, client applications, Zookeeper, and end‑to‑end pipelines.
Key broker metrics: network threads, I/O threads, replica fetchers, compression type, request queues, GC, etc. Sample configuration snippets:
num.network.threads # recommended = CPU cores * 2
num.io.threads # recommended = number of disks * 2
num.replica.fetchers # recommended = CPU cores / 4
compression.type # use lz4
queued.max.requests # >= 500 in production
auto.leader.rebalance.enable # set to falseClient‑side metrics reporter example:
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, IntegerSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// ClientMetricsReporter implements MetricsReporter
props.put(ProducerConfig.METRIC_REPORTER_CLASSES_CONFIG, ClientMetricsReporter.class.getName());
...1.5 Resource Isolation
Physical isolation of business groups within the same cluster is described, using resource groups (e.g., Group1, Group2) to prevent one business’s traffic surge from affecting another.
1.6 Cluster Classification
Clusters are categorized by workload (log, monitoring, billing, search, offline, online) to avoid cross‑impact.
1.7 Scaling (Expand/Shrink)
Guidelines for adding or removing broker nodes, including intelligent evaluation, automated planning, and platform‑driven execution.
1.8 Load Balancing
Manual and automated rebalancing using tools like LinkedIn Cruise Control are discussed. The migration plan generation should consider core metrics (in/out traffic, rack awareness, partition dispersion) and avoid moving already balanced partitions.
1.9 Security Authentication
ACLs for producers, consumers, and data‑directory migrations are required. Documentation: http://kafka.apache.org
1.10 Disaster Recovery
Cross‑rack and cross‑cluster replication (MirrorMaker 2.0) are suggested for high availability.
1.11 Parameter & Configuration Optimization
Key broker and client parameters are listed to improve throughput and latency.
1.12 Linux Server Tuning
File‑handle limits, page‑cache settings, and other OS‑level tweaks are recommended (see the linked article on PageCache tuning).
1.13 Platformization
Operations such as configuration management, rolling restarts, cluster management, mock data generation, permission management, and automated scaling are advocated to reduce manual SSH work.
1.14 Performance Evaluation
Evaluations at broker, topic‑partition, disk, and cluster scale levels help guide capacity planning and threshold settings.
1.15 Network Architecture (DNS + LVS)
For very large clusters, using a DNS name backed by LVS load balancers simplifies client bootstrap configuration.
2 Open‑Source Version Limitations
Identified gaps in Kafka 2.1.1 include lack of incremental migration, concurrent migration, migration termination, and bugs related to data‑directory migrations. Workarounds and custom patches are described.
3 Future Trends
Links to Kafka community roadmap, KIP‑500 (ZooKeeper removal), KIP‑630 (Raft controller), tiered storage, partition reduction, MirrorMaker2 exactly‑once, and other upcoming features.
4 Contributing to the Community
Guidance on where to submit patches, wiki contributions, issue trackers, and a list of main committers.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.