Design, Evolution, and Optimization of NetEase's Log Collection and Transmission Service (Datastream‑NG)
This article presents a comprehensive overview of NetEase's log collection and transmission platform, detailing its evolution from 2011 to the current Datastream‑NG architecture, the system's design goals, core component optimizations, operational monitoring, and future plans for intelligent scaling and diagnostics.
Introduction – In internet applications, logs are a primary data source; an efficient, stable log collection and transmission service is crucial for offline/real‑time data warehouses, search, recommendation, and APM. NetEase has built such a service since 2011, undergoing three major architectural phases and a major reconstruction in 2019.
Development Stages
1.0 (2011‑2014): fragmented open‑source agents (Flume, Logstash‑forwarder, Fluent) with no unified management, leading to chaos and difficulty tracing delays or loss.
2.0 (2015‑2018): introduction of a self‑developed tailfile client, HDFS archiving handler, and Kafka as the message bus; task management was product‑level but still tied to a data‑dev platform.
3.0 (2019‑present): reconstruction into Datastream‑NG (DS‑NG) with a self‑developed DS Agent, Router service for unified data routing, independent task‑control platform, and Flink‑based sink replacing the stateful HDFS handler.
Overall Design of DS‑NG
The system consists of multiple data sources (Web/App clients, big‑data components, backend services, gateway logs) feeding into DS Agents, which register with a Zookeeper‑based registry. Agents send data to DS Router clusters via HAProxy; routers perform flow control, routing, and load balancing before writing to Kafka. Downstream consumers include offline warehouses (HDFS, Alluxio+S3), real‑time warehouses (Flink), and OLAP databases (ClickHouse, HBase). The design emphasizes high throughput, high availability, easy operations, and low cost.
Design Principles
Fast Flow: minimal parsing/repacking, custom protocol separating control and data, batch compression, and version‑compatible upgrades.
Statelessness: back‑pressure model, credit‑based flow control, ACK/Checkpoint mechanisms, at‑least‑once delivery guarantees.
Self‑Adaptivity: Router memory pool management, automatic traffic migration between routers and producers, and dynamic Kafka partitioning strategies.
Core Component Optimizations
DS Agent – Re‑designed job model classifies log files as fast, slow, or inactive, applying different collection strategies (polling, inotify, timed checks) and memory pre‑allocation to reduce CPU and I/O pressure.
HDFS Sink – Flink‑based exactly‑once job with custom partitioning to keep related logs together, batch buffering, and checkpointing to achieve ~1:6 storage compression.
Monitoring & Alerting – End‑to‑end metrics stored in Redis, NTSDB, and a custom time‑series DB, with user‑configurable alerts for latency, backlog, and task failures.
Application Effects & Future Plans
Currently >20,000 DS Agents are deployed across K8s and VMs, handling ~5,000 tasks and processing 5 × 10¹² log entries daily (~600 TB). Human‑effort efficiency has improved by 200% compared to version 2.0.
Future work focuses on intelligent operations: automatic elastic scaling of routers and Flink jobs, rapid root‑cause diagnosis, enhanced compression for cross‑region transfer, and better data‑validation services.
Q&A Highlights
Team size: ~3 engineers plus 1‑2 ops staff.
SDKs are business‑specific; agents are deployed by the business side.
Comparison with open‑source Loggie: Loggie is a K8s‑focused agent without the extensive pipeline.
End‑to‑end latency under normal conditions is 200‑300 ms.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.