Big Data 20 min read

How DeWu Halved Observability Costs Using AutoMQ and ClickHouse Storage‑Compute Separation

DeWu’s observability platform faced scalability, cost, and operational challenges from petabyte‑scale trace data, prompting a shift to a storage‑compute separated architecture that leverages AutoMQ’s Kafka‑compatible service and ClickHouse Enterprise’s SharedMergeTree engine, ultimately achieving up to 50% cost reduction and five‑fold cold‑read performance gains.

dbaplus Community
dbaplus Community
dbaplus Community
How DeWu Halved Observability Costs Using AutoMQ and ClickHouse Storage‑Compute Separation

Introduction

DeWu, a leading fashion e‑commerce community, generates several petabytes of trace data and trillions of span records daily, creating severe demands for real‑time processing and low‑cost storage.

The traditional monolithic storage‑compute architecture caused three main problems:

Limited scalability: Compute and storage resources could not be expanded independently, forcing synchronized scaling and increasing costs.

Low resource utilization: Inflexible resource allocation led to idle compute or storage capacity.

High operational complexity: Cluster scaling required complex data migration, raising operational risk.

To address these issues, DeWu adopted a storage‑compute separation architecture, integrating AutoMQ, Kafka, and ClickHouse.

Kafka Challenges at Scale

Kafka is central to DeWu’s data pipeline, but rapid data growth exposed several limitations:

High storage cost: Kafka’s storage accounted for roughly three‑quarters of cloud resource expenses.

Poor cold‑read efficiency: Disk throughput saturated during cold‑read workloads, creating performance bottlenecks.

Operational difficulty: Scaling the Kafka cluster required risky data migration and downtime.

These issues stem from Kafka’s native shared‑nothing architecture, which is ill‑suited for elastic cloud environments.

Why AutoMQ?

AutoMQ was chosen as a drop‑in replacement for Kafka because it offers:

100% Kafka protocol compatibility: Seamless migration without client changes.

Storage‑compute separation: AutoMQ replaces Kafka’s storage layer with a shared object‑storage‑based repository (S3Stream), dramatically lowering storage costs and enabling independent scaling of compute and storage.

Elastic scaling: Resources can be adjusted dynamically without data migration or service interruption.

Future‑proof extensibility: Supports large‑scale data growth and integration with modern storage/compute tools.

Cold‑Read Optimizations

AutoMQ mitigates Kafka’s cold‑read penalties (e.g., KAFKA‑7504) through:

Object‑storage and compute decoupling: Eliminates interference between cold reads and writes.

Efficient query path: Optimized query handling maintains stable performance under high concurrency.

Benchmarks show AutoMQ’s cold‑read throughput matches Kafka’s while preserving write latency, delivering roughly a five‑fold improvement in cold‑read efficiency.

Rapid Elastic Scaling

AutoMQ’s shared‑storage design enables second‑level partition migration. During scaling, partitions are reassigned to new nodes via ASG (AWS Auto Scaling Group) or Kubernetes HPA, typically completing within ten seconds. Unlike Kafka, which requires full data replication, AutoMQ avoids costly data rebalancing.

ClickHouse Evolution: Storage‑Compute Separation in Practice

DeWu’s platform also migrated its trace index storage from the open‑source ClickHouse community edition to ClickHouse Enterprise, which introduces a storage‑compute separation architecture.

Key innovations include:

SharedMergeTree engine: Fully compatible with the MergeTree engine but optimized for shared object storage (OSS, S3, MinIO). It automatically converts community‑edition DDL to Enterprise‑edition syntax.

Serverless compute model: Nodes scale automatically based on load, reducing idle resources.

Benefits of SharedMergeTree:

Data resides exclusively in shared storage; compute nodes are stateless.

Cluster management is simplified—only a single table definition is required.

Horizontal scaling can be performed in minutes without service interruption.

Scaling Workflow

New compute node registers with the metadata service (Keeper) and begins listening for metadata changes.

Node synchronizes metadata instantly, avoiding cluster lock‑down.

Node starts handling queries immediately, accessing shared storage as needed.

Performance and Cost Optimizations

Write throughput: Supports up to 20 million rows per second, with typical 40 k rows written in ~1 second.

Query acceleration: Parallel Replica feature yields up to 2.5× faster queries.

Index tuning: Optimized ORDER BY and filter ordering reduce data scans.

Sample query used for trace analysis:

select trace_id, span_id, duration from span_index where service = 'order-xxx' and startTime between '2024-11-23 16:00:00' and '2024-11-23 17:00:00' order by duration desc limit 0,30 settings max_threads = 16, allow_experimental_parallel_reading_from_replicas = 1;

Reliability Enhancements

Three‑node Keeper + at least two compute nodes provide fault tolerance; storage resides in highly redundant object storage.

Automatic failover ensures continuous service despite single‑node failures.

Elastic Cost Model

Compute resources are billed per second, scaling in 1 CCU (~1 CPU + 4 GB) increments.

Storage is billed by actual usage, cutting storage costs by over 70 % compared to pre‑purchased capacity.

Results

After six months of production, AutoMQ replaced the entire Kafka stack, and ClickHouse Enterprise powered the trace index. The platform achieved:

≈50 % reduction in cloud bill.

Replacement of nearly a thousand CPU cores, delivering tens of GiB/s aggregate throughput.

Zero downtime during Double‑11 peak, handling 100 % of traffic with stable latency.

Overall, the combined storage‑compute separation strategy lowered total cost by ~60 % (20 % compute, >70 % storage) while meeting high‑concurrency, low‑latency requirements.

References

AutoMQ S3‑based shared streaming storage: https://docs.automq.com/zh/automq/architecture/s3stream-shared-streaming-storage/overview

KAFKA‑7504 cold‑read issue: https://issues.apache.org/jira/browse/KAFKA-7504

Linux Page Cache: https://en.wikipedia.org/wiki/Page_cache

Linux SendFile: https://man7.org/linux/man-pages/man2/sendfile.2.html

AutoMQ performance whitepaper: https://docs.automq.com/zh/automq/benchmarks/benchmark-automq-vs-apache-kafka

AutoMQ partition reassignment in seconds: https://docs.automq.com/zh/automq/architecture/technical-advantage/partition-reassignment-in-seconds

AWS Auto Scaling Groups: https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html

Kubernetes HPA: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/

AutoMQ continuous self‑balancing: https://docs.automq.com/zh/automq/architecture/technical-advantage/continuous-self-balancing

Alibaba Cloud ClickHouse: https://help.aliyun.com/zh/clickhouse/?spm=a2c4g.11174283.0.0.61f5735a0zfJIS

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataKafkaClickHouseStorage Compute SeparationCost reductionAutoMQ
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.