How AutoMQ Transforms Kafka into a Cloud‑Native, Elastic Messaging Service
This article examines the limitations of traditional Kafka in large‑scale deployments and presents AutoMQ’s cloud‑native redesign—detailing its stateless architecture, storage separation, automatic scaling, read/write isolation, performance benchmarks, and real‑world migration case studies that demonstrate reduced latency, higher throughput, and lower resource costs.
Why Traditional Kafka Becomes a Bottleneck
Kafka’s broker stores partitions on local disks, making the cluster stateful. Adding nodes requires cross‑network partition migration, which can take hours to days and cause performance jitter. Cold‑read workloads (e.g., log replay, historical back‑fill) pollute the page cache and block network threads via the sendfile system call, leading to write‑latency spikes (P99 delays) and affecting unrelated topics. This issue is tracked as KAFKA‑7504.
AutoMQ Design and Architectural Advantages
AutoMQ is a diskless Kafka implementation that replaces the local‑disk storage layer with object storage (e.g., PoleFS or S3) while reusing the Apache Kafka source code for compute and protocol handling. The key benefits are:
100 % Kafka protocol compatibility – existing producers, consumers, and SDKs require no code changes.
On‑demand scaling with theoretically unlimited throughput – brokers become stateless, so scaling only moves metadata.
Cold‑hot isolation – separate write, hot‑read, and cold‑read paths prevent cold‑read traffic from impacting write latency.
Cold‑Hot Isolation Design
Three independent data paths are implemented:
Write path : Direct I/O bypasses the page cache, ensuring cold reads cannot interfere with writes.
Hot read (real‑time) : Write‑ahead log (WAL) plus an in‑memory cache provides millisecond‑level latency.
Cold read (catch‑up) : Data is fetched from object storage with prefetch logic, using a dedicated channel that does not compete for write resources.
Performance Evaluation
1. Baseline Benchmark
Test environment: 8‑node broker cluster, OpenMessaging Benchmark, acks=all. Two load levels were measured.
Send latency (ms)
100 MiB/s – Avg 1.28, P50 0.99, P95 1.55, P99 11.98
500 MiB/s – Avg 1.51, P50 0.84, P95 2.83, P99 19.13
End‑to‑end latency (ms)
100 MiB/s – Avg 2.2, P50 2.0, P95 3.0, P99 14.0
500 MiB/s – Avg 22.55, P50 19.0, P95 46.0, P99 65.0
2. Cold‑Read Isolation Test
Setup: 2‑node cluster, producer at 100 MiB/s, accumulated 100 GiB of backlog (exceeding memory cache). Consumers started from the earliest offset. Production throughput and latency remained stable while the cold‑read peak reached ~461 MiB/s, confirming effective isolation.
3. Elastic Scaling Test
Scenario: start with a single broker handling 1 GiB/s, then expand to 16 brokers. Automatic load balancing allowed the new brokers to absorb traffic in approximately 4 minutes (monitoring alert 60 s, batch scaling 60 s, auto‑balancing 120 s).
Production Deployment Patterns
HA Deployment Architecture
Each AutoMQ cluster is paired with a standby cluster that periodically synchronizes metadata. Health‑check alerts trigger DNS or endpoint switches; the standby cluster scales on demand without pre‑loading data.
Log Retrieval Platform Refactor
Original pipeline (Kafka → Elasticsearch) suffered >10⁹ messages (~200 GB) backlog and P99 write latency of ~10 s during peaks. After migrating to AutoMQ:
Backlog reduced ~40× to ~5 GB.
P99 write latency dropped to ~500 ms.
Peak throughput reached 1.4 GB/s with smooth write curves.
Hardware cost decreased ~50 % (30 pods of 4 CPU 16 GB handled the peak load).
Conclusion
Moving from a traditional stateful Kafka deployment to the cloud‑native, stateless AutoMQ architecture delivers minute‑level elastic scaling, strong read/write isolation, and lower operational overhead while preserving full Kafka protocol compatibility.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
