Big Data 19 min read

NetEase Game Streaming ETL Architecture and Practices Based on Flink

This article presents NetEase Game's Flink‑based streaming ETL system, detailing business background, log classifications, specialized and generic ETL services, Python UDF integration, runtime optimizations, HDFS write tuning, SLA metrics, fault‑tolerance mechanisms, and future roadmap for unified data lakes and PyFlink support.

IT Architects Alliance

May 30, 2021

NetEase Game Streaming ETL Architecture and Practices Based on Flink

NetEase Game collects massive, often semi‑structured logs from clients, servers, and infrastructure, feeding them into a unified Kafka pipeline before ETL processes load the data into Hive (offline) or Kafka (real‑time) warehouses for downstream SQL analytics.

The platform distinguishes three log categories—operational logs with fixed schemas, business logs with diverse formats, and program logs for debugging—each requiring tailored ETL handling.

Three ETL services are offered: a dedicated operational‑log ETL, the generic EntryX ETL for all other text logs, and ad‑hoc ETL jobs for special cases such as encrypted data.

Operational‑log ETL evolved from Hadoop Streaming (2013) to Spark Streaming (2017) and finally to Flink DataStream (2018), preserving existing Python UDF scripts by embedding a Jython runner within Flink ProcessFunctions to execute pre‑ and post‑processing UDFs.

At runtime, Flink jobs are generated from a common JAR and configuration pulled from a ConfigServer; Python modules are packaged via YARN and hot‑updated through a background thread to avoid full job restarts for lightweight changes.

EntryX defines Source, StreamingTable, and Sink modules, automatically generating Flink jobs that parse JSON/Regex/CSV, extend data via UDFs, format fields, and load into Hive or Kafka using configurable schemas, supporting unified real‑time/offline schemas.

Optimization practices include consolidating multiple stream tables from the same Kafka topic into a single Flink job to reduce source reads, pre‑partitioning data by Hive table keys to mitigate small‑file explosion on HDFS, and implementing OperatorState‑based SLA‑Utils for robust metric collection.

Fault tolerance is achieved by routing erroneous records to side‑output streams, storing them in HDFS for later inspection, and providing offline repair jobs or schema updates with partition location swaps to recover corrupted data.

Future plans aim to add data‑lake support for update/delete workloads, richer real‑time features such as deduplication and automatic small‑file merging, and full PyFlink integration for end‑to‑end Python pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Big Data Flink Streaming ETL Data Integration

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.