How NetEase Games Built a Scalable Flink‑Based Streaming ETL Platform
This article explains how NetEase Games engineers designed and operated a Flink‑driven streaming ETL system, covering business background, log classification, dedicated and generic ETL services, architecture evolution, Python UDF integration, runtime optimizations, tuning practices, fault‑tolerance mechanisms, and future roadmap.
Preface
NetEase Game senior development engineer Lin Xiaobo introduces the Flink‑based streaming ETL construction for NetEase games. The article covers business background, dedicated ETL, the generic EntryX ETL, optimization practices, and future plans.
Business Background
NetEase Game collects logs from client, server and infrastructure (e.g., Nginx, DB) via Kafka. Logs are semi‑structured or unstructured and must be transformed before loading into real‑time (Flink SQL) or offline (Hive/Spark) warehouses.
Streaming ETL Requirements
Game logs differ from typical relational data: they often use schema‑free stores like MongoDB, leading to heterogeneous fields; data is heavily nested due to denormalized design; real‑time warehouse demands dual loading; and frequent log type changes (thousands of types, bi‑weekly releases) require robust error handling.
Log Classification
Logs are divided into three categories:
Operational logs – fixed header+JSON, used for reporting, analysis, and in‑game recommendations.
Business logs – arbitrary formats (binary or text) such as Nginx access logs, used for richer custom analytics.
Program logs – INFO/ERROR messages for troubleshooting, usually stored in Elasticsearch.
ETL Service Overview
Three ETL services are provided:
Dedicated operational‑log ETL, customized for the fixed schema.
EntryX generic ETL for all other text logs.
Ad‑hoc ETL jobs for special cases (encryption, custom transformations).
Operational‑Log ETL Evolution
2013: Hadoop Streaming + Python scripts (offline). 2017: Spark Streaming prototype (not production). 2018: Flink DataStream version with Python UDF support.
Operational‑Log ETL Architecture
Early Hadoop version read from HDFS, invoked Python mapper for pre‑process, parse/transform, post‑process, then wrote to Hive. The Flink rewrite replaces HDFS source with Kafka, uses Flink Source/Sink operators, rewrites core modules in Java, and runs Python UDFs via Jython.
Python UDF Implementation
Flink ProcessFunction wraps a Runner layer that executes Jython code. Runner initializes in open(), adds dependencies to PYTHONPATH, and invokes UDFs via reflection. Pre‑process UDF receives strings (PyUnicode), post‑process receives maps (PyDictionary), returns PyObject converted back to Java.
Operational‑Log Runtime
ETL jobs are built from a common Flink jar and configuration fetched from ConfigServer. Python modules are packaged and distributed via YARN. Lightweight configuration changes trigger hot‑updates via a background thread in each TaskManager.
EntryX Generic ETL
EntryX supports any text‑based log format and serves both data analysts and developers. Users configure Source (Kafka topic), StreamingTable (schema, transformations), and Sink (Hive/Kafka). The system generates Flink jobs automatically.
EntryX Pipeline
Data flows: Source → Filter (keyword filtering) → Parser (JSON/Regex/CSV) → Extender (UDF‑driven derived fields) → Formatter (type conversion) → Splitter (routing to tables) → Loader (write to Hive or Kafka).
Unified Real‑Time/Offline Schema
EntryX maintains a logical streaming table schema independent of storage. The same schema can generate physical DDL for Hive or Kafka connectors, enabling consistent data definitions across warehouses.
Real‑Time Warehouse Integration
EntryX registers logical schemas with Avatar’s schema center, which also stores physical Avro schemas. A custom KafkaCatalog reads these schemas to create Flink TableSource/TableSink. DataStream users can also query the physical schema directly.
EntryX Runtime Optimizations
Multiple stream tables sharing a Kafka source are merged into a single Flink job to avoid redundant reads. When Hive write back‑pressure threatens Kafka throughput, the pipeline is split into separate real‑time and offline jobs based on SLA.
HDFS Write Tuning
Pre‑partitioning the data stream by target Hive partition reduces the number of files written (e.g., from n × m to s × m). Adding a salt key prevents data skew.
SLA Statistics
SLA‑Utils tracks recordsIn, recordsOut, recordsDropped, recordsErrored, plus custom metrics and TTL metrics via OperatorState snapshots. Metrics are exposed through Flink Accumulators for UI consumption.
Fault Tolerance and Recovery
SideOutput collects erroneous records with error codes and original payloads; they are written to HDFS for monitoring. Recovery is performed either by offline batch jobs (format fixes) or by targeted back‑fill jobs that rewrite corrupted Hive partitions.
Future Plans
Support for data lake storage to handle update/delete workloads.
Features such as real‑time deduplication and automatic small‑file merging.
Full PyFlink support for Python‑based data‑warehouse processing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
