Big Data 11 min read

Practical Application of Flink + Kafka at NetEase Cloud Music: Architecture, Platform Design, and Lessons Learned

This article presents a detailed case study of NetEase Cloud Music’s real‑time analytics platform built on Kafka and Flink, covering background, architectural choices, platform‑level design, operational challenges, solutions such as the Magina framework, and a Q&A on reliability and monitoring.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Application of Flink + Kafka at NetEase Cloud Music: Architecture, Platform Design, and Lessons Learned

NetEase Cloud Music operates more than 200 Kafka broker nodes across 10+ clusters, handling peak QPS of over 4 million and running 500+ real‑time Flink jobs. The speaker, a senior real‑time computing engineer, outlines the system’s background and the motivations for choosing Kafka as the messaging backbone and Flink as the unified stream‑and‑batch engine.

Kafka was selected for its high throughput, low latency, massive concurrency, fault tolerance, and easy horizontal scaling. Flink was chosen for its high performance, flexible windowing, exactly‑once state semantics, lightweight fault‑tolerance, event‑time handling, and ability to run both streaming and batch workloads.

The combined Kafka‑Flink stack forms the core of a platform‑level architecture that ingests logs from client/web sources, processes them in real time, and writes results to various downstream stores. The platform has been refactored into a “Magina” layer that provides a unified SQL/SDK API, catalog management, topic‑as‑table abstraction, and schema handling.

Key platform features include catalog‑level management of Kafka clusters, treating topics as streaming tables, and automatic schema registration. Users can create and maintain Kafka tables in a metadata center, then query them via Flink without dealing with low‑level details.

Operationally, the team faced challenges such as cluster pressure from massive topics, I/O spikes, duplicate consumption when using multiple sinks, and latency spikes caused by shared network switches. Solutions involved topic‑level data sharding, dynamic routing rules, isolation of compute and storage clusters, and dedicated network paths for real‑time and batch workloads.

A monitoring system was built to surface cluster health, topic statistics, and Flink job metrics (input bandwidth, TPS, latency, lag). This enables rapid diagnosis of abnormal consumption patterns and cluster‑wide issues.

The Q&A section discusses data reliability in real‑time warehouses, learning from production problems, and mechanisms for detecting and handling anomalous Kafka records.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkKafkaplatform designLambda architecture
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.