Big Data 13 min read

SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

NetEase Media migrated SparkSQL to Kubernetes in 2021, using storage‑compute decoupling, hybrid deployment, custom scripts, Kyuubi failover, and extensive monitoring and resource governance, which cut cluster size by over 30% while keeping CPU utilization above 80% and GC throughput above 95%.

NetEase Media Technology Team
NetEase Media Technology Team
NetEase Media Technology Team
SparkSQL on Kubernetes: NetEase Media's Cloud-Native Big Data Infrastructure Practice

With the development of cloud-native technology, embracing cloud-native architecture has become a major trend in big data infrastructure. NetEase Media successfully deployed SparkSQL to Kubernetes (K8S) clusters in 2021 and achieved hybrid deployment with online services, running stably for over a year. This article summarizes the optimization practices implemented during this migration.

Migration Benefits:

Storage-compute decoupling enables independent elastic scaling of storage and compute resources

Unified resource pooling allows hybrid deployment with online services for peak-valley complementarity, improving resource utilization and reducing costs

Migration Strategy:

Tasks were migrated gradually from non-core downstream tasks to upstream tasks

Custom script tasks were migrated first, with Kyuubi enabling easy failover to Yarn cluster

Monitoring Solutions:

Deployed monitoring programs on task scheduler machines using Spark driver's HTTP API in client mode

Built Hygieia task monitoring service and Grafana dashboards for resource monitoring

Monitor real-time task lists, resource allocation (CPU/memory), task status, and cluster metrics

Resource Governance:

CPU: Enabled CPU oversubscription by setting spark.kubernetes.executor.request.cores=1 with spark.executor.cores=4 for maximum parallelism

Memory: Implemented automated memory adjustment based on historical executor heap and off-heap memory usage, with health monitoring using GC throughput metrics

Disk: Migrated to SSD storage; adopted zstd compression for shuffle data (nearly 2x compression ratio improvement over lz4); scheduled large shuffle tasks to Yarn cluster

Scheduling Optimizations:

Created dedicated scheduler for SparkSQL to isolate scheduling load

Implemented anti-affinity scheduling to distribute executor pods across different nodes

Enabled priority scheduling with PriorityClass to guarantee resources for SLA-critical tasks

Results: Cluster scale reduced by 30%+ while maintaining stable or improved baseline output, with CPU utilization at 80%+ during 0-7 AM and GC throughput above 95%.

Future Work: Expand SparkSQL on K8S scale, explore hybrid deployment with Flink on K8S, and investigate cloud-native storage solutions including object storage and RSS (Remote Shuffle Service).

cloud-nativeBig DataSparkSQLKubernetesresource optimizationInfrastructureSpark on K8SK8S Migration
NetEase Media Technology Team
Written by

NetEase Media Technology Team

NetEase Media Technology Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.