How Huolala Scaled Kafka: From Integrated Design to Cloud‑Native Elastic Architecture
This article chronicles the evolution of Huolala’s Kafka infrastructure—from an integrated compute‑storage design to a separated compute‑storage model with multi‑tenant deployment, and finally to a cloud‑native elastic architecture—detailing the challenges of capacity awareness, alarm configuration, and cost‑effective performance optimization.
Background
As Huolala’s business grew rapidly, the traffic and load on its Kafka clusters increased significantly. Kafka, the core messaging middleware for Huolala, has undergone architectural evolution, which this article details.
Kafka Architecture Evolution
2.1 Kafka Architecture 1.0 – Integrated Compute and Storage
Before 2020, part of Huolala’s Kafka clusters used an integrated compute‑storage architecture, often experiencing cluster jitter due to local disk failures.
2.2 Kafka Architecture 2.0 – Compute‑Storage Separation
All current Kafka clusters have adopted a compute‑storage separation architecture.
The clusters have been refactored to leverage this separation, improving stability, resource efficiency, and operational efficiency.
2.2.1 Multi‑tenant Architecture
The clusters are deployed at large scale with multi‑tenant architecture, assigning different instance specifications to different traffic scenarios, greatly improving resource utilization and providing fault isolation.
2.2.2 Automatic Capacity Bottleneck Detection
Hundreds of nodes with over 60 ECS instance types are used. Different instance specs have varying network bandwidth and cloud‑disk throughput limits.
Key metrics: network bandwidth (NetIn/NetOut) and maximum cloud‑disk read/write throughput.
Challenge 1: How to determine whether current traffic poses a capacity risk given many instance types?
Challenge 2: How to configure alerts for different instance specs in a multi‑tenant environment?
Solution:
Identify each ECS instance’s network bandwidth and cloud‑disk throughput limits.
DMS platform collects instance metadata and calculates limits in real time.
Upgrade DMS alarm system to support node‑level percentage alerts.
2.3 Kafka Architecture 3.0 – Cloud‑Native Elastic Architecture
For clusters handling massive data (e.g., security, big data), cloud‑disk performance becomes a bottleneck.
Challenges
Different cloud‑disk performance levels (PL0‑PL3) impose strict throughput caps, risking disk saturation during traffic spikes.
Example: a node using PL1 (350 MB/s) sees peak traffic of 285 MB/s, reaching 81 % of the safe threshold.
Solution: AutoPL Elastic Cloud‑Disk
AutoPL provides a configurable performance model (PL1‑N) allowing fine‑grained throughput settings.
Performance = baseline (PL1) + pre‑configured boost + burst capacity.
Benefits:
Enhanced stability through burst capability.
Improved resource efficiency by customizing throughput, reducing costs up to 50 %.
Higher operational efficiency: DBAs can adjust disk performance without hardware changes.
Conclusion
Through the progression from integrated to separated to cloud‑native elastic architectures, Huolala’s Kafka clusters have achieved higher resource utilization, stability, fault isolation, and elastic scaling. Ongoing evolution will continue to meet business growth and emerging technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
