How NetEase Game Teams Built a Scalable Flink‑Based Streaming ETL Platform
This article explains how NetEase games collect heterogeneous logs, design a Flink‑driven streaming ETL pipeline, handle schema‑free sources, implement Python UDFs with Jython, optimize HDFS writes, manage real‑time and offline warehouses, and share practical tuning and fault‑tolerance techniques.
Business Background
NetEase games collect raw logs from clients, servers, and infrastructure (e.g., Nginx, DB logs) via a unified Kafka pipeline. The logs are semi‑structured or unstructured and must be processed by an ETL service before being stored in real‑time (Flink SQL) or offline (Hive/Spark) warehouses, where analysts can run SQL queries.
Streaming ETL Requirements
The game industry often uses schema‑free document stores such as MongoDB, so there is no stable online schema. Log fields and formats vary widely, and database designs favor denormalized, deeply nested structures. Over 1,000 log types change bi‑weekly, requiring robust error handling and rapid schema evolution.
Log Classification
Operational logs : player actions (login, reward claim) with a fixed header + JSON format, used for reporting, analytics, and in‑game recommendations.
Business logs : heterogeneous data such as Nginx access logs or CDN logs, with no fixed format (text or binary), serving similar analytical purposes but requiring custom handling.
Program logs : INFO/ERROR logs from application code, primarily for troubleshooting, usually written to Elasticsearch but sometimes also to the data warehouse.
ETL Service Overview
Dedicated operational‑log ETL, customized for the fixed operational format.
EntryX, a generic text‑log ETL that serves all non‑operational logs.
Ad‑hoc ETL jobs for special cases (encryption, custom transformations).
Operational Log ETL Evolution
2013: Hadoop Streaming + Python scripts for offline ETL.
2017: Prototype Spark Streaming version (POC, not production‑ready).
2018: Flink DataStream version supporting seamless migration of existing Python UDFs.
Operational Log ETL Architecture
The early Hadoop version dumped data to HDFS, invoked Python scripts via Mapper, and wrote results to Hive. The Flink rewrite replaces HDFS with direct Kafka sources, uses Flink Source/Sink operators, rewrites the common module in Java, and retains Python UDFs for pre‑ and post‑processing.
Python UDF Implementation
A Runner layer built on Jython executes Python code inside the JVM, avoiding the overhead of launching external Python processes. Jython’s limitation (no C‑based libraries like pandas) is acceptable for the required UDFs.
The ProcessFunction lazily creates the Runner in its open method (Jython is not serializable). The Runner prepares the Python environment, loads modules, and invokes the configured UDF via reflection.
During execution, pre‑processing UDFs receive strings converted to PyUnicode, while post‑processing UDFs receive parsed Map objects converted to PyDictionary. Results are returned as PyObject, then transformed back to Java String or Map for Flink output.
Runtime and Configuration
When an ETL job is submitted, a Flink JobGraph is built from a common JAR and configuration pulled from a ConfigServer. Python modules listed in the configuration are packaged as resources and distributed via YARN to TaskManagers.
Lightweight configuration changes (e.g., adding a field or adjusting a filter) trigger hot‑updates: each TaskManager runs a background thread that periodically polls the ConfigServer and applies changes without restarting the job.
EntryX Generic ETL
EntryX targets two dimensions of generality: it accepts any text‑based log format (JSON, CSV, regex) and serves both data‑analysis developers and non‑technical product teams.
Core Concepts
Source : Input Kafka topics (raw logs) or filtered topics.
StreamingTable : Defines the logical schema (field names, types, constraints) using a SQL‑like type system, including extensions such as JSON.
Sink : Maps the StreamingTable to physical storage tables (Hive, Kafka) based on schema‑mapping rules.
ETL Pipeline
Data flow: Source → Filter (ensures uniform schema) → Extract → Transform → Load.
Transform consists of:
Parser : JSON / Regex / CSV parsing.
Extender : Applies built‑in functions or Python UDFs to derive new fields (e.g., flattening JSON).
Formatter : Converts logical types to physical Java types (e.g., BIGINT → long).
Load stage uses a Splitter to route records to target tables based on expressions, then a Loader writes to Hive (Text/Parquet/JSON) or Kafka (JSON/Avro).
Unified Real‑Time & Offline Schema
EntryX writes each record to both real‑time (Kafka) and offline (Hive) stores, sharing the same logical StreamingTable schema while using different connectors and formats. The logical schema serves as a single source of truth, enabling automatic DDL generation for new physical tables.
Real‑Time Warehouse Integration
Instead of relying on Hive Metastore for Kafka schema persistence (which adds heavy Hive dependencies), NetEase extends its Avatar schema registry to store both logical and physical schemas. A custom KafkaCatalog reads these schemas to create Flink TableSources/TableSinks, and DataStream API users can also consume the physical schema directly.
Runtime Optimizations
To avoid reading the same Kafka topic thousands of times, multiple StreamingTables from the same source are co‑executed in a single Flink job.
When Hive write latency causes back‑pressure on Kafka, the pipeline can be split into separate real‑time and offline jobs, respecting SLA and cluster performance.
Tuning Practices
HDFS Write Optimization
Small‑file explosion is mitigated by pre‑partitioning the data stream: a keyBy on the target Hive partition reduces the number of subtasks that write each partition, dramatically lowering the total file count.
OperatorState‑Based SLA Statistics
A lightweight SLA‑Utils library records metrics ( recordsIn, recordsOut, recordsDropped, recordsErrored) and custom metrics. Dynamic TTL metrics are supported by filtering them out during checkpoint snapshots, since OperatorState itself lacks TTL support.
Data Fault Tolerance & Recovery
Errors are captured via SideOutput streams, enriched with error codes and original payload, and written to HDFS for monitoring. Recovery paths include offline batch jobs to fix malformed records and dedicated back‑fill jobs that rewrite corrected data to temporary locations before swapping Hive partitions.
Future Plans
Upcoming work includes adding data‑lake support for update/delete workloads, automated small‑file merging and real‑time deduplication, and full PyFlink integration to extend Python support beyond the ETL stage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
