Operations 9 min read

How Huolala Scaled Kafka: From Integrated Design to Cloud‑Native Elastic Architecture

This article chronicles the evolution of Huolala’s Kafka infrastructure—from an integrated compute‑storage design to a separated compute‑storage model with multi‑tenant deployment, and finally to a cloud‑native elastic architecture—detailing the challenges of capacity awareness, alarm configuration, and cost‑effective performance optimization.

Huolala Tech

Nov 14, 2024

How Huolala Scaled Kafka: From Integrated Design to Cloud‑Native Elastic Architecture

Background

As Huolala’s business grew rapidly, the traffic and load on its Kafka clusters increased significantly. Kafka, the core messaging middleware for Huolala, has undergone architectural evolution, which this article details.

Kafka Architecture Evolution

2.1 Kafka Architecture 1.0 – Integrated Compute and Storage

Before 2020, part of Huolala’s Kafka clusters used an integrated compute‑storage architecture, often experiencing cluster jitter due to local disk failures.

2.2 Kafka Architecture 2.0 – Compute‑Storage Separation

All current Kafka clusters have adopted a compute‑storage separation architecture.

The clusters have been refactored to leverage this separation, improving stability, resource efficiency, and operational efficiency.

2.2.1 Multi‑tenant Architecture

The clusters are deployed at large scale with multi‑tenant architecture, assigning different instance specifications to different traffic scenarios, greatly improving resource utilization and providing fault isolation.

2.2.2 Automatic Capacity Bottleneck Detection

Hundreds of nodes with over 60 ECS instance types are used. Different instance specs have varying network bandwidth and cloud‑disk throughput limits.

Key metrics: network bandwidth (NetIn/NetOut) and maximum cloud‑disk read/write throughput.

Challenge 1: How to determine whether current traffic poses a capacity risk given many instance types?

Challenge 2: How to configure alerts for different instance specs in a multi‑tenant environment?

Solution:

Identify each ECS instance’s network bandwidth and cloud‑disk throughput limits.

DMS platform collects instance metadata and calculates limits in real time.

Upgrade DMS alarm system to support node‑level percentage alerts.

2.3 Kafka Architecture 3.0 – Cloud‑Native Elastic Architecture

For clusters handling massive data (e.g., security, big data), cloud‑disk performance becomes a bottleneck.

Challenges

Different cloud‑disk performance levels (PL0‑PL3) impose strict throughput caps, risking disk saturation during traffic spikes.

Example: a node using PL1 (350 MB/s) sees peak traffic of 285 MB/s, reaching 81 % of the safe threshold.

Solution: AutoPL Elastic Cloud‑Disk

AutoPL provides a configurable performance model (PL1‑N) allowing fine‑grained throughput settings.

Performance = baseline (PL1) + pre‑configured boost + burst capacity.

Benefits:

Enhanced stability through burst capability.

Improved resource efficiency by customizing throughput, reducing costs up to 50 %.

Higher operational efficiency: DBAs can adjust disk performance without hardware changes.

Conclusion

Through the progression from integrated to separated to cloud‑native elastic architectures, Huolala’s Kafka clusters have achieved higher resource utilization, stability, fault isolation, and elastic scaling. Ongoing evolution will continue to meet business growth and emerging technologies.

operations Kafka capacity planning Multi‑Tenant

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.