SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice
NetEase Media migrated SparkSQL to Kubernetes in 2021, using storage‑compute decoupling, hybrid deployment, custom scripts, Kyuubi failover, and extensive monitoring and resource governance, which cut cluster size by over 30% while keeping CPU utilization above 80% and GC throughput above 95%.
With the development of cloud-native technology, embracing cloud-native architecture has become a major trend in big data infrastructure. NetEase Media successfully deployed SparkSQL to Kubernetes (K8S) clusters in 2021 and achieved hybrid deployment with online services, running stably for over a year. This article summarizes the optimization practices implemented during this migration.
Migration Benefits:
Storage-compute decoupling enables independent elastic scaling of storage and compute resources
Unified resource pooling allows hybrid deployment with online services for peak-valley complementarity, improving resource utilization and reducing costs
Migration Strategy:
Tasks were migrated gradually from non-core downstream tasks to upstream tasks
Custom script tasks were migrated first, with Kyuubi enabling easy failover to Yarn cluster
Monitoring Solutions:
Deployed monitoring programs on task scheduler machines using Spark driver's HTTP API in client mode
Built Hygieia task monitoring service and Grafana dashboards for resource monitoring
Monitor real-time task lists, resource allocation (CPU/memory), task status, and cluster metrics
Resource Governance:
CPU: Enabled CPU oversubscription by setting spark.kubernetes.executor.request.cores=1 with spark.executor.cores=4 for maximum parallelism
Memory: Implemented automated memory adjustment based on historical executor heap and off-heap memory usage, with health monitoring using GC throughput metrics
Disk: Migrated to SSD storage; adopted zstd compression for shuffle data (nearly 2x compression ratio improvement over lz4); scheduled large shuffle tasks to Yarn cluster
Scheduling Optimizations:
Created dedicated scheduler for SparkSQL to isolate scheduling load
Implemented anti-affinity scheduling to distribute executor pods across different nodes
Enabled priority scheduling with PriorityClass to guarantee resources for SLA-critical tasks
Results: Cluster scale reduced by 30%+ while maintaining stable or improved baseline output, with CPU utilization at 80%+ during 0-7 AM and GC throughput above 95%.
Future Work: Expand SparkSQL on K8S scale, explore hybrid deployment with Flink on K8S, and investigate cloud-native storage solutions including object storage and RSS (Remote Shuffle Service).
NetEase Media Technology Team
NetEase Media Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.