Cloud Native 16 min read

How Tencent Music Cut Kafka Costs by 50% with Cloud‑Native AutoMQ

Tencent Music migrated its massive Kafka streaming infrastructure to the cloud‑native AutoMQ platform, slashing operational costs by over half, achieving second‑level partition migration, and dramatically improving scaling efficiency while maintaining high‑throughput, low‑latency data processing for its music services.

DataFunTalk
DataFunTalk
DataFunTalk
How Tencent Music Cut Kafka Costs by 50% with Cloud‑Native AutoMQ

Background

Tencent Music Entertainment Group is a leading online music service in China, generating massive user behavior and business data daily. A robust, stable, and efficient Kafka streaming system underpins its core services.

Rapid growth exposed limitations of self‑built Kafka clusters in operational complexity and cost. To handle increasing data volume and achieve cost efficiency, the operations team explored next‑generation Kafka solutions.

They adopted the cloud‑native AutoMQ, reducing costs by over 50% and enabling second‑level partition migration, greatly improving scaling efficiency and lowering operational burden.

Technical Architecture

AutoMQ serves as the core data bus, connecting data sources, ingestion, the Kafka stream system, real‑time computation, storage, and data applications.

Data sources include observability data (logs, metrics, traces) and analytical data (metadata, user behavior). Data ingestion is handled by the internal “Data Channel” platform, which preprocesses and routes data to AutoMQ.

AutoMQ clusters provide high‑throughput, low‑latency streaming. Real‑time computation uses Flink for aggregation, filtering, and complex calculations. Processed data is stored in OLAP databases for BI and in Elasticsearch for log search.

Data applications are divided into observability (monitoring, alerting) and analytics (personalized recommendation, BI, data science).

Kafka Challenges

Cost pressures: high resource reservation, expensive storage, and replication overhead increase total cost of ownership.

Operational difficulties: scaling requires lengthy partition migration (≈1 day) and manual interventions for hotspots, leading to risk and inefficiency.

Why AutoMQ

Eliminates operational bottlenecks with second‑level partition migration and automated scaling.

Separates compute and storage, allowing independent scaling and reducing both compute reservation and storage costs.

Stateless brokers fit Kubernetes, enabling cloud‑native deployment.

Native Iceberg support via Table Topic simplifies data‑lake ingestion.

100% Kafka protocol compatibility ensures zero‑code migration.

Evaluation and Migration

Two load‑testing phases (high‑throughput, high‑QPS) demonstrated AutoMQ’s stability and performance meeting production requirements.

Migration proceeded in three steps: switch producers, drain the old cluster, then switch consumers, achieving seamless data flow without downtime.

Online Results

Six AutoMQ clusters are now in production, reaching 1.6 GiB/s write throughput and ~480 K QPS.

Cost reduced by >50%, and scaling can be performed in seconds with a self‑balancing mechanism.

Future Outlook

Complete migration of remaining Kafka clusters.

Deploy Table Topic to stream data into Iceberg tables.

Standardize AutoMQ as an internal infrastructure component.

Fully migrate Kafka services to Kubernetes.

operationsKafkacost optimizationAutoMQData Streaming
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.