Big Data 9 min read

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

The presentation details NetEase's big data platform operations, covering current usage, the internally built EasyOps control system, a generic service‑operation framework based on Ansible, Prometheus‑Grafana monitoring, configuration management, network and storage optimizations, and lessons learned from cloud migration.

Big Data Technology Architecture

Jun 2, 2021

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

Speaker Jin Chuan, a senior SRE from NetEase Hangzhou Research Institute, introduced the agenda: an overview of NetEase's big data applications, the status of the internal EasyOps platform, a generic big‑data service operation framework, Prometheus‑based monitoring and alerting, and practical operational experiences.

The big data platform supports major NetEase products such as Cloud Music and Yanxuan, built on a Hadoop ecosystem with over 22 components and a self‑developed middle‑platform "YouShu" comprising about 27 services, spanning six offline clusters and two streaming clusters (Spark Streaming, Flink).

EasyOps was created to replace Ambari due to various limitations; it provides a unified UI for service instances, host management, configuration history, and deployment actions. The platform's front‑end and back‑end technology stacks (details omitted) integrate with Grafana dashboards for unified monitoring.

The generic operation framework is organized around Ansible Runner, with playbooks and role directories (defaults, tasks, templates, vars). It supports service installation, configuration changes, and custom operations such as HDFS data migration, YARN queue management, and more.

Monitoring relies on a high‑availability Prometheus architecture, supplemented by micrometer for JVM metrics, and custom log collection via DSAgent feeding Kafka, then stored in Elasticsearch, NTSDB, or MySQL for visualization and alerting. Grafana Alert and a customized Prometheus AlertManager enable flexible alarm rules.

Operational insights include network design using spine‑leaf architecture for high east‑west traffic, storage separation with HDFS Router/Federation and YARN node labels to reduce costs by at least 20%, and cloud migration using Alluxio as an abstraction layer over various object stores (S3, OBS, OSS).

Performance optimization principles were summarized, emphasizing cost‑effective architecture, storage‑compute separation, and incremental tuning. The session concluded with a Q&A and an invitation to follow NetEase YouShu's public account for further technical articles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Big Data SRE Prometheus platform operations Ansible EasyOps

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.