How Tencent Music Cut Kafka Costs by 50% with Cloud‑Native AutoMQ
Tencent Music migrated its massive Kafka streaming infrastructure to the cloud‑native AutoMQ platform, slashing operational costs by over half, achieving second‑level partition migration, and dramatically improving scaling efficiency while maintaining high‑throughput, low‑latency data processing for its music services.
Background
Tencent Music Entertainment Group is a leading online music service in China, generating massive user behavior and business data daily. A robust, stable, and efficient Kafka streaming system underpins its core services.
Rapid growth exposed limitations of self‑built Kafka clusters in operational complexity and cost. To handle increasing data volume and achieve cost efficiency, the operations team explored next‑generation Kafka solutions.
They adopted the cloud‑native AutoMQ, reducing costs by over 50% and enabling second‑level partition migration, greatly improving scaling efficiency and lowering operational burden.
Technical Architecture
AutoMQ serves as the core data bus, connecting data sources, ingestion, the Kafka stream system, real‑time computation, storage, and data applications.
Data sources include observability data (logs, metrics, traces) and analytical data (metadata, user behavior). Data ingestion is handled by the internal “Data Channel” platform, which preprocesses and routes data to AutoMQ.
AutoMQ clusters provide high‑throughput, low‑latency streaming. Real‑time computation uses Flink for aggregation, filtering, and complex calculations. Processed data is stored in OLAP databases for BI and in Elasticsearch for log search.
Data applications are divided into observability (monitoring, alerting) and analytics (personalized recommendation, BI, data science).
Kafka Challenges
Cost pressures: high resource reservation, expensive storage, and replication overhead increase total cost of ownership.
Operational difficulties: scaling requires lengthy partition migration (≈1 day) and manual interventions for hotspots, leading to risk and inefficiency.
Why AutoMQ
Eliminates operational bottlenecks with second‑level partition migration and automated scaling.
Separates compute and storage, allowing independent scaling and reducing both compute reservation and storage costs.
Stateless brokers fit Kubernetes, enabling cloud‑native deployment.
Native Iceberg support via Table Topic simplifies data‑lake ingestion.
100% Kafka protocol compatibility ensures zero‑code migration.
Evaluation and Migration
Two load‑testing phases (high‑throughput, high‑QPS) demonstrated AutoMQ’s stability and performance meeting production requirements.
Migration proceeded in three steps: switch producers, drain the old cluster, then switch consumers, achieving seamless data flow without downtime.
Online Results
Six AutoMQ clusters are now in production, reaching 1.6 GiB/s write throughput and ~480 K QPS.
Cost reduced by >50%, and scaling can be performed in seconds with a self‑balancing mechanism.
Future Outlook
Complete migration of remaining Kafka clusters.
Deploy Table Topic to stream data into Iceberg tables.
Standardize AutoMQ as an internal infrastructure component.
Fully migrate Kafka services to Kubernetes.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
