Big Data 24 min read

Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling

MiHoYo’s data platform team details their migration of Spark workloads to Alibaba Cloud’s ACK Kubernetes service, describing how the Spark‑on‑K8s + OSS‑HDFS architecture delivers elastic compute, up to 50% cost reduction, and true compute‑storage separation, while addressing operational challenges through custom operators, Celeborn, and robust monitoring.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling

Background

With the rapid growth of MiHoYo’s business, offline data volume and Spark job count increased dramatically, making the original YARN‑based big‑data architecture insufficient for new scenarios.

Motivation and Goals

To overcome lack of elasticity, complex operations, and low resource utilization, the team evaluated cloud‑native solutions in late 2022 and deployed a Spark‑on‑K8s + OSS‑HDFS architecture on Alibaba Cloud, which has been stable in production for about a year.

Key Benefits

Elastic Compute : Kubernetes’ native scaling handles peak loads that can be dozens to hundreds of times the normal level, smoothing resource spikes during game updates, events, and new releases.

Cost Savings : Using ACK’s elastic resources, Spot ECI instances, and custom Spark component tweaks reduced compute costs by roughly 50% for comparable workloads.

Compute‑Storage Decoupling : Spark runs entirely on K8s compute nodes while data resides on OSS‑HDFS; shuffle data is handled by Celeborn, achieving clear separation of compute and storage.

Architecture Evolution

Initially, Spark on K8s mixed online and offline workloads in a single cluster, which improved utilization but introduced complexity, poor isolation, and deviated from cloud‑native principles. The final design adopts OSS‑HDFS (JindoFS) for storage, ACK for container orchestration, and Spark 3.2.3 for stability.

Spark on Kubernetes Basics

Each Spark driver and executor runs in its own Pod with a unique IP. The driver pod starts first, then requests executor pods from the API server. After job completion, the driver cleans up all executor pods.

Execution Flow

Developers submit Spark jobs via a custom launcher that invokes spark‑k8s‑cli, which ultimately calls spark‑submit on the K8s cluster. The driver pod schedules executor pods, which interact with external services (Hive, Iceberg, OLAP, OSS‑HDFS) and use Celeborn for shuffle data.

Task Submission Methods

Native spark‑submit – simple but lacks job tracking, UI services, and automatic cleanup.

Spark‑on‑K8s‑operator – requires pre‑installed operator; adds management, service/ingress creation, and monitoring but raises integration complexity. spark‑k8s‑cli – combines advantages of both, integrates with the operator, supports interactive shells, and uses the same syntax as native spark‑submit.

Enhancements to spark‑k8s‑cli

Support multi‑region cluster submission for load balancing and failover.

Automatic queuing when K8s quota is exhausted.

Network exception handling and retry logic for pod creation failures.

Department‑level throttling for large‑scale back‑fill jobs.

Built‑in alerts for submission, container creation, and runtime timeouts.

Logging and Visualization

Driver and executor logs are redirected to OSS via a shutdown hook, enabling users to view logs directly in Spark UI or JobHistory without leaving the cluster. A custom Spark Running Web UI also exposes services and ingresses generated by the operator.

Elasticity and Cost Reduction

ACK’s auto‑scaling combined with Spot ECI (serverless containers) provides second‑level provisioning and billing, ideal for Spark’s bursty workloads. The system prefers Spot ECI instances and falls back to regular ECI when Spot capacity is unavailable.

Celeborn Integration

To avoid disk‑size constraints on K8s nodes, the team adopted Alibaba’s open‑source Celeborn as an external shuffle service, using a push‑shuffle model that writes sequentially and reads efficiently, with internal enhancements for network transmission, metrics, and stability.

Kyuubi on K8s

Kyuubi provides a multi‑tenant gateway for Spark, Flink, and Trino SQL queries. The team containerized Kyuubi, redirected its logs to OSS, and integrated it with the Spark operator for seamless ad‑hoc query execution on K8s.

K8s Manager

A lightweight Spring Boot service monitors pod, quota, service, configmap, ingress, and role resources across clusters, aggregates custom metrics (CPU/Memory usage, Spark task counts, ECI distribution, etc.), and triggers alerts.

Other Operational Issues and Solutions

Slow Elastic NIC Release : Upgraded Alibaba Cloud networking stack to accelerate NIC release.

Watcher Failure : Reset watchers on driver failure; contributed fix to Spark upstream.

Task Stalling : Combined Watch and extended List (5‑minute interval) with cached pod snapshots to reduce API server pressure.

Celeborn Read/Write Timeouts : Optimized Celeborn code, tuned master/worker parameters, and upgraded ECI kernel.

Quota Lock Contention : Added retry logic for driver pod creation on quota conflicts; executor retries are handled automatically.

UnknownHost Errors on Bulk Submissions : Assigned a Terway pod per ECS node and enabled Terway IP cache to ensure network readiness.

Cross‑AZ Packet Loss : Configured ECI scheduling to VSwitchOrdered, keeping executors within a single AZ to avoid inter‑AZ packet loss.

Summary and Outlook

The cloud‑native Spark architecture has been stable in production for nearly a year. Future work will focus on further optimizing the overall solution for higher availability, containerizing more big‑data components, and achieving finer‑grained resource management and precise cost control.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataKubernetesCost OptimizationSparkStorage Decoupling
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.