NetEase Big Data Platform: HDFS Optimization and Practices
NetEase’s senior big‑data engineer shares how the company’s large‑scale data platform leverages Hadoop, HDFS, YARN and related technologies, detailing multi‑layer architecture, cross‑cloud deployment, storage optimizations, NameNode performance enhancements, RPC prioritization, and practical lessons from operating petabyte‑scale clusters.
In this talk, NetEase senior big‑data engineer Zhu Jianghua introduces the architecture and operational experience of NetEase’s big data platform, focusing on HDFS optimization and practical deployment at petabyte scale.
The platform is organized into six logical layers: the application development layer (visual data‑development tools), the application scenario layer (data‑related services), the data computation layer (Hive, Spark, Flink, etc.), the data management layer (YARN scheduling), the data storage layer (HDFS, optional HBase for high‑throughput workloads), and the data source layer (structured, semi‑structured, and unstructured data).
Key features include cross‑cloud deployment, multi‑region clusters for high availability, and a clear separation of compute and storage to reduce resource waste. The storage layer relies primarily on HDFS, with HBase used where low‑latency access is required.
HDFS is deployed with active‑standby NameNodes and many DataNodes. High‑availability is achieved through HA and automatic replica placement. The system can dynamically expand to hundreds or thousands of nodes without service interruption.
Operational challenges arise as the cluster grows: increased metadata size, longer NameNode restart times, and RPC bottlenecks. NetEase addressed these by parallelizing FSImage loading, parallel inode verification, and multithreaded metadata parsing, achieving 60‑70% faster NameNode startups.
Further optimizations include extending Lease periods for DataNode full‑reporting, reducing duplicate uploads, and improving RPC priority queues to protect critical workloads. A Router‑based federation (IBF) isolates namespaces, balances load, and enables seamless migration between clusters.
Monitoring is performed with Matrix metrics and custom scripts, tracking RPC latency, DataNode heartbeats, CPU, memory, and I/O usage to quickly detect anomalies.
Data management adopts tiered storage: hot clusters for frequently accessed data and cold clusters using erasure coding (RS6x3) for infrequently accessed data, reducing storage costs by 1.3‑1.4×.
Compute resources are decoupled from storage via YARN, allowing better CPU and memory utilization. High‑density hardware such as NVMe SSDs further improves performance by over 20%.
The talk concludes with a real‑world case where a complex data‑warehouse workload processes tens of petabytes daily, requiring 24‑hour availability. Through the described optimizations, NetEase achieved stable, high‑performance operation.
Thank you for listening.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.