Operations 14 min read

Tencent Cloud Kafka Automated Operations Practices

Tencent Cloud’s senior engineer Yang Yuan explains how their managed Kafka service tackles version diversity, resource allocation, dynamic scaling, broker addition/removal, and partition migration using versioned clusters, bin‑packing algorithms, penalty weighting, and predictive scheduling to sustain trillions of messages and billions of messages per minute.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Kafka Automated Operations Practices

Speaker Yang Yuan, senior engineer in Tencent Cloud Infrastructure, presents the topic "Tencent Cloud Kafka Automated Operations Practice" and shares the problems encountered during Kafka operation and the solutions implemented.

Tencent Cloud Kafka is a high‑scalable, high‑throughput cloud service based on Apache Kafka. It eliminates the need for users to deploy or maintain clusters, offers instance‑based billing, dynamic scaling, and seamless integration with big‑data suites and cloud storage.

The service currently handles trillions of messages, over 10 PB of traffic, with peak rates of billions of messages per minute, thousands of brokers, hundreds of clusters, and thousands of topics.

Five major challenge categories are identified:

Multiple Kafka versions used by different customers.

Efficient allocation of instance resources across cloud nodes.

Dynamic instance scaling (up/down) based on user needs.

Decision‑making for broker addition or removal.

Partition creation, expansion, and migration criteria.

To address version heterogeneity, multiple versioned clusters are deployed, and a unified message format conversion layer is added so users do not need to know the underlying version.

For broker selection, a bin‑packing‑like algorithm evaluates bandwidth and disk capacity of candidate brokers. Instances are assigned to brokers to keep bandwidth and disk utilization in a 1:1 ratio, and a penalty mechanism reduces the weight of nodes that cannot accommodate the smallest future instance.

Instance scaling (upgrade) is handled by checking whether the current broker has sufficient residual resources. If not, the service either migrates the instance to a new broker or distributes its load across additional nodes, using the same resource‑allocation formula.

Node addition occurs when resource pools cannot satisfy new instance sales or when fragmentation arises; node removal happens after instance shrinkage or when broker failures occur.

Migration decisions prioritize partitions with smaller data size and lower production/consumption rates to minimize impact, and target nodes with higher overall resource utilization to achieve load balance.

Future outlook includes expanding the scheduling dimensions to CPU, memory, and I/O, improving migration‑time measurement, and developing predictive scheduling that can anticipate scaling needs before problems arise.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Resource ManagementKafkascalingOperations Automation
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.