Big Data 15 min read

Real‑Time Platform Construction at NetEase Yanxuan: Architecture, SQL‑Based Streaming, Serviceization, and Data Governance

This article details NetEase Yanxuan's evolution of a real‑time data platform from 2017 to present, covering background, current scale, layered architecture, Flink‑SQL development IDE, service‑oriented task execution, resource‑optimizing deployment modes, cloud‑native migration, comprehensive data governance, and future batch‑stream integration plans.

DataFunSummit

Dec 10, 2021

Real‑Time Platform Construction at NetEase Yanxuan: Architecture, SQL‑Based Streaming, Serviceization, and Data Governance

NetEase Yanxuan, an e‑commerce platform, needs low‑latency, accurate data for decisions such as real‑time dashboards, risk control, and monitoring, prompting the construction of a large‑scale real‑time computing platform.

Background – Since 2017 the platform has progressed through phases: initial platform exploration (2017), Streaming SQL launch (June 2018), service‑oriented development and Flink on K8s (2019), governance system (2020), and recent batch‑stream convergence experiments.

Current Scale – Over 5,000 tasks run daily, peaking at ~20 million events per second, with end‑to‑end latency measured in seconds. Use cases include real‑time dashboards, risk algorithms, log monitoring, and APM alerts.

Architecture – The stack consists of a data‑infrastructure layer (Kafka, Pulsar, YARN/K8s, storage), a service‑oriented abstraction that hides Flink‑task details, a platform layer offering development, operations, monitoring, metadata, and lineage tools, and a suite of applications built on top (ETL, data‑warehouse, risk control).

SQL‑Based Streaming (Real‑Time Task SQLization) – To lower the development barrier, an Atom IDE provides Flink‑SQL development, debugging, and deployment, integrating unified metadata, UDF repository, and version control. Design focuses on unified metadata management, extensible UDFs, and functional extensions such as connectors, dimension‑table enhancements, window triggers, and DDL extensions.

Task Submission & Debugging – SQL is compiled to a JobGraph and submitted to the cluster, loading required connectors and UDFs. A debug mode rewrites SQL, intercepts output, and allows custom online data sampling for precise troubleshooting.

Resource Optimization & Cloud‑Native Deployment – Compared per‑Job (isolated) and session (shared) Flink modes, then introduced a hybrid session mode with resource‑strategy pools to balance isolation and utilization. Migrated from YARN to Kubernetes for true cloud‑native deployment, adding node‑selector scheduling, ingress REST exposure, API‑server resilience, side‑car logging, Service‑Mesh support, native memory leak fixes, and Zookeeper‑based JobManager HA.

Data Governance – Full‑chain governance ensures data quality and resource efficiency. Monitoring uses OpenTSDB for metrics at operator granularity. Lineage is captured by parsing SQL ASTs or reflecting DAG structures of jar tasks. Governance operates on table (e.g., Kafka topic) and task dimensions, enabling hot‑table scaling, cold‑table cleanup, and automated task‑level diagnostics for latency, skew, back‑pressure, and resource shortages.

Future Plans – Continue batch‑stream integration by incorporating Iceberg data‑lake support, explore unified compute/storage architectures, and enhance intelligent job diagnosis with an automated control service that applies optimization suggestions without manual intervention.

Q&A Highlights – Discussed idempotent sink handling in FlinkSQL (retract/upsert modes) and the high demand for debugging features, which replace sources with sampled Kafka data and sinks with WebSocket or file outputs for rapid result inspection.

Overall, the platform achieves sub‑second task latency, high stability, and a scalable, cloud‑native architecture for real‑time data processing at NetEase Yanxuan.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Big Data Flink Data Governance real-time data processing streaming SQL

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.