Big Data 11 min read

Practical Application of Flink + Kafka in NetEase Cloud Music Real‑Time Computing Platform

This article presents NetEase Cloud Music's real‑time computing platform built on Flink and Kafka, covering background, architectural design, Kafka and Flink selection reasons, platformization, warehouse usage, encountered challenges, and the solutions implemented to improve reliability and performance.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Practical Application of Flink + Kafka in NetEase Cloud Music Real‑Time Computing Platform

Introduction – NetEase Cloud Music real‑time computing platform engineer Yue Meng shares a practical case study of using Flink + Kafka in production, organized into four parts: background, platform design, Kafka in real‑time data warehouse, and problems & improvements.

Background – The streaming platform typically consists of a message queue, a compute engine, and storage. NetEase collects logs from clients/web into a queue, processes them in real time, and stores results in append‑only or update‑style stores.

Why Kafka? – Kafka is chosen for its high throughput (hundreds of thousands of QPS), low latency (millisecond level), high concurrency (thousands of clients), fault tolerance, and seamless horizontal scaling.

Why Flink? – Flink offers high throughput, low latency, flexible windowing, exactly‑once state semantics, lightweight fault‑tolerance, event‑time support, and a unified batch‑stream engine, making it ideal for NetEase's streaming needs.

Kafka + Flink Architecture – Logs from apps/web are ingested into Kafka, then Flink performs ETL, global aggregation, and windowed computations. The architecture diagram is shown below:

NetEase Cloud Music Kafka Usage – Over 10 Kafka clusters serve different roles (business, mirror, compute) with more than 200 nodes, peak QPS > 4M, and >500 real‑time Flink tasks.

Platform Design (Flink + Kafka) – To reduce development and operation costs, NetEase rebuilt the platform (Magina) on Flink 1.0, providing Magina SQL and SDK, catalog‑based metadata, and three key Kafka integrations: cluster cataloging, topic stream‑table mapping, and message schema management. Architecture diagram:

Kafka in Real‑Time Data Warehouse – Early on, large topics caused I/O pressure and latency. Flink 1.5 was used to split large topics into smaller ones, initially with static rules, later with dynamic rules. Subsequent challenges (cluster pressure, I/O spikes, duplicate consumption, migration difficulty) were addressed by isolating clusters (DS, log‑collect, dispatch) and layering data processing, as illustrated:

Problems & Improvements – Two major issues were identified: (1) duplicate consumption of Kafka sources under multiple sinks, solved by merging StreamGraph DAGs and buffering modify operations; (2) latency spikes caused by traffic bursts on the same switch affecting both offline and real‑time clusters, mitigated by separating network switches for offline and streaming workloads.

Q&A – Answers cover Kafka data reliability (depends on definition and fault‑tolerance), learning from production problems (problem‑driven learning), and anomaly detection (routing abnormal data to a dedicated topic for later inspection).

big dataFlinkreal-time streamingKafkaData Warehouseplatform design
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.