Big Data 23 min read

Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)

This article presents a comprehensive overview of NetEase's log collection and transmission platform, detailing its evolution from 2011 to the current Datastream‑NG architecture, the system's design goals, core component optimizations, operational monitoring, and future plans for intelligent scaling and diagnostics.

DataFunTalk

Jul 31, 2022

Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)

Introduction – In internet applications, logs are a primary data source; an efficient, stable log collection and transmission service is crucial for offline/real‑time data warehouses, search, recommendation, and APM. NetEase has built such a service since 2011, undergoing three major architectural phases and a major reconstruction in 2019.

Development Stages

1.0 (2011‑2014): fragmented open‑source agents (Flume, Logstash‑forwarder, Fluent) with no unified management, leading to chaos and difficulty tracing delays or loss.

2.0 (2015‑2018): introduction of a self‑developed tailfile client, HDFS archiving handler, and Kafka as the message bus; task management was product‑level but still tied to a data‑dev platform.

3.0 (2019‑present): reconstruction into Datastream‑NG (DS‑NG) with a self‑developed DS Agent, Router service for unified data routing, independent task‑control platform, and Flink‑based sink replacing the stateful HDFS handler.

Overall Design of DS‑NG

The system consists of multiple data sources (Web/App clients, big‑data components, backend services, gateway logs) feeding into DS Agents, which register with a Zookeeper‑based registry. Agents send data to DS Router clusters via HAProxy; routers perform flow control, routing, and load balancing before writing to Kafka. Downstream consumers include offline warehouses (HDFS, Alluxio+S3), real‑time warehouses (Flink), and OLAP databases (ClickHouse, HBase). The design emphasizes high throughput, high availability, easy operations, and low cost.

Design Principles

Fast Flow: minimal parsing/repacking, custom protocol separating control and data, batch compression, and version‑compatible upgrades.

Statelessness: back‑pressure model, credit‑based flow control, ACK/Checkpoint mechanisms, at‑least‑once delivery guarantees.

Self‑Adaptivity: Router memory pool management, automatic traffic migration between routers and producers, and dynamic Kafka partitioning strategies.

Core Component Optimizations

DS Agent – Re‑designed job model classifies log files as fast, slow, or inactive, applying different collection strategies (polling, inotify, timed checks) and memory pre‑allocation to reduce CPU and I/O pressure.

HDFS Sink – Flink‑based exactly‑once job with custom partitioning to keep related logs together, batch buffering, and checkpointing to achieve ~1:6 storage compression.

Monitoring & Alerting – End‑to‑end metrics stored in Redis, NTSDB, and a custom time‑series DB, with user‑configurable alerts for latency, backlog, and task failures.

Application Effects & Future Plans

Currently >20,000 DS Agents are deployed across K8s and VMs, handling ~5,000 tasks and processing 5 × 10¹² log entries daily (~600 TB). Human‑effort efficiency has improved by 200% compared to version 2.0.

Future work focuses on intelligent operations: automatic elastic scaling of routers and Flink jobs, rapid root‑cause diagnosis, enhanced compression for cross‑region transfer, and better data‑validation services.

Q&A Highlights

Team size: ~3 engineers plus 1‑2 ops staff.

SDKs are business‑specific; agents are deployed by the business side.

Comparison with open‑source Loggie: Loggie is a K8s‑focused agent without the extensive pipeline.

End‑to‑end latency under normal conditions is 200‑300 ms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Cloud Native Performance Optimization Big Data log collection Data Streaming

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.