Big Data 22 min read

How NetEase Games Built a Scalable Flink‑Based Streaming ETL Platform

This article explains how NetEase Games engineers designed and operated a Flink‑driven streaming ETL system, covering business background, log classification, dedicated and generic ETL services, architecture evolution, Python UDF integration, runtime optimizations, tuning practices, fault‑tolerance mechanisms, and future roadmap.

Java High-Performance Architecture

Jun 14, 2021

How NetEase Games Built a Scalable Flink‑Based Streaming ETL Platform

Preface

NetEase Game senior development engineer Lin Xiaobo introduces the Flink‑based streaming ETL construction for NetEase games. The article covers business background, dedicated ETL, the generic EntryX ETL, optimization practices, and future plans.

Business Background

NetEase Game collects logs from client, server and infrastructure (e.g., Nginx, DB) via Kafka. Logs are semi‑structured or unstructured and must be transformed before loading into real‑time (Flink SQL) or offline (Hive/Spark) warehouses.

Streaming ETL Requirements

Game logs differ from typical relational data: they often use schema‑free stores like MongoDB, leading to heterogeneous fields; data is heavily nested due to denormalized design; real‑time warehouse demands dual loading; and frequent log type changes (thousands of types, bi‑weekly releases) require robust error handling.

Log Classification

Logs are divided into three categories:

Operational logs – fixed header+JSON, used for reporting, analysis, and in‑game recommendations.

Business logs – arbitrary formats (binary or text) such as Nginx access logs, used for richer custom analytics.

Program logs – INFO/ERROR messages for troubleshooting, usually stored in Elasticsearch.

ETL Service Overview

Three ETL services are provided:

Dedicated operational‑log ETL, customized for the fixed schema.

EntryX generic ETL for all other text logs.

Ad‑hoc ETL jobs for special cases (encryption, custom transformations).

Operational‑Log ETL Evolution

2013: Hadoop Streaming + Python scripts (offline). 2017: Spark Streaming prototype (not production). 2018: Flink DataStream version with Python UDF support.

Operational‑Log ETL Architecture

Early Hadoop version read from HDFS, invoked Python mapper for pre‑process, parse/transform, post‑process, then wrote to Hive. The Flink rewrite replaces HDFS source with Kafka, uses Flink Source/Sink operators, rewrites core modules in Java, and runs Python UDFs via Jython.

Python UDF Implementation

Flink ProcessFunction wraps a Runner layer that executes Jython code. Runner initializes in open(), adds dependencies to PYTHONPATH, and invokes UDFs via reflection. Pre‑process UDF receives strings (PyUnicode), post‑process receives maps (PyDictionary), returns PyObject converted back to Java.

Operational‑Log Runtime

ETL jobs are built from a common Flink jar and configuration fetched from ConfigServer. Python modules are packaged and distributed via YARN. Lightweight configuration changes trigger hot‑updates via a background thread in each TaskManager.

EntryX Generic ETL

EntryX supports any text‑based log format and serves both data analysts and developers. Users configure Source (Kafka topic), StreamingTable (schema, transformations), and Sink (Hive/Kafka). The system generates Flink jobs automatically.

EntryX Pipeline

Data flows: Source → Filter (keyword filtering) → Parser (JSON/Regex/CSV) → Extender (UDF‑driven derived fields) → Formatter (type conversion) → Splitter (routing to tables) → Loader (write to Hive or Kafka).

Unified Real‑Time/Offline Schema

EntryX maintains a logical streaming table schema independent of storage. The same schema can generate physical DDL for Hive or Kafka connectors, enabling consistent data definitions across warehouses.

Real‑Time Warehouse Integration

EntryX registers logical schemas with Avatar’s schema center, which also stores physical Avro schemas. A custom KafkaCatalog reads these schemas to create Flink TableSource/TableSink. DataStream users can also query the physical schema directly.

EntryX Runtime Optimizations

Multiple stream tables sharing a Kafka source are merged into a single Flink job to avoid redundant reads. When Hive write back‑pressure threatens Kafka throughput, the pipeline is split into separate real‑time and offline jobs based on SLA.

HDFS Write Tuning

Pre‑partitioning the data stream by target Hive partition reduces the number of files written (e.g., from n × m to s × m). Adding a salt key prevents data skew.

SLA Statistics

SLA‑Utils tracks recordsIn, recordsOut, recordsDropped, recordsErrored, plus custom metrics and TTL metrics via OperatorState snapshots. Metrics are exposed through Flink Accumulators for UI consumption.

Fault Tolerance and Recovery

SideOutput collects erroneous records with error codes and original payloads; they are written to HDFS for monitoring. Recovery is performed either by offline batch jobs (format fixes) or by targeted back‑fill jobs that rewrite corrupted Hive partitions.

Future Plans

Support for data lake storage to handle update/delete workloads.

Features such as real‑time deduplication and automatic small‑file merging.

Full PyFlink support for Python‑based data‑warehouse processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Flink Game Analytics streaming ETL

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.