Big Data 25 min read

How We Transformed Big Data Workloads with Spark on Kubernetes and OSS‑HDFS

Facing rapid growth in offline data and compute demands, we migrated our big‑data platform to a cloud‑native architecture using Spark 3.2.3 on Kubernetes with OSS‑HDFS storage, achieving elastic scaling, cost reduction, and compute‑storage separation while detailing implementation, challenges, and operational insights.

Alibaba Cloud Big Data AI Platform

Nov 10, 2023

How We Transformed Big Data Workloads with Spark on Kubernetes and OSS‑HDFS

Background

With the maturation of cloud‑native technologies such as containers, micro‑services, and Kubernetes, many companies are moving their enterprise applications to cloud‑native environments. Rapid business growth at MiHoYo caused a sharp increase in offline data volume and compute tasks, making the original big‑data architecture insufficient.

Motivation and Benefits

To address the lack of elasticity, complex operations, and low resource utilization of the legacy system, we redesigned the architecture in late 2022 using Alibaba Cloud EMR on ACK, ECI, and cloud storage. The new solution—Spark on K8s + OSS‑HDFS—delivers three major benefits:

Elastic Computing: Kubernetes’ native elasticity allows Spark jobs to scale up during peak game‑related offline workloads, which can be dozens to hundreds of times higher than normal.

Cost Savings: By leveraging ACK’s elastic resources, customized Spark components, and Spot ECI instances, we achieve significant cost reductions while maintaining the same compute capacity.

Compute‑Storage Separation: Spark runs on K8s using compute resources, while data is gradually migrated from HDFS to OSS‑HDFS; shuffle data is handled by Celeborn, decoupling storage from compute.

Architecture Evolution

Initially, Spark could run on various resource managers (Yarn, K8s, Mesos). In China most companies still use Yarn, and Spark only gained stable K8s support in version 3.1 (GA). Despite its later start, Spark on K8s offers strong elasticity and cost benefits, prompting continuous evolution.

Offline Mixed Deployment

Many enterprises deploy Spark on K8s alongside online services in a mixed fashion, assigning offline nodes to the same K8s namespace during off‑peak hours. This improves resource utilization but introduces complexity, poor isolation, and deviates from cloud‑native principles.

Spark on K8s + OSS‑HDFS Design

To overcome mixed‑deployment drawbacks, we adopted a fully cloud‑native stack: OSS‑HDFS (JindoFS) for storage, Alibaba Cloud ACK for the container service, and Spark 3.2.3 for processing. OSS‑HDFS is HDFS‑compatible, offers unlimited capacity, cold‑hot storage tiers, atomic directory operations, and millisecond‑level rename, making it a drop‑in replacement for HDFS.

ACK provides scalable Kubernetes management, while ECI (elastic container instances) offers serverless, per‑second billing, ideal for bursty Spark workloads.

Basic Principles of Spark on K8s

In Kubernetes, a Pod is the smallest scheduling unit. Spark Driver and each Executor run in separate Pods, each with a unique IP address. When a Spark job is submitted, the Driver Pod starts first, then requests Executor Pods from the API server. After job completion, the Driver cleans up all Executor Pods.

Execution Flow

Developers submit Spark jobs, which are scheduled by a custom Launcher middleware that invokes spark‑k8s‑cli. The CLI submits the job to the K8s cluster. The Driver Pod requests Executor Pods, which interact with Hive, Iceberg, OLAP databases, OSS‑HDFS, etc. Shuffle data is transferred via Celeborn.

Task Submission Methods

Native spark‑submit: Simple but lacks job tracking, Service/Ingress creation, and automatic cleanup.

spark‑on‑k8s‑operator: Requires installing the Spark operator; provides management, Service/Ingress handling, and monitoring, but integration with existing schedulers is limited.

spark‑k8s‑cli: Our production‑grade solution that combines spark‑submit and operator features, supports interactive shells, and uses the same syntax as native spark‑submit.

Initially, all Spark Submit JVM processes ran in a shared Gateway Pod, causing failures when the Gateway crashed. We switched to a per‑task Submit Pod model, where each Submit Pod launches its own Driver on a fixed ECS node, ensuring isolation and automatic resource release.

Log Collection and Visualization

K8s does not provide Yarn‑like log aggregation. We instrumented Spark to upload Driver and Executor logs to OSS via a shutdown hook. Logs become accessible directly from Spark UI and JobHistory, and a custom Spark Running Web UI shows real‑time logs via K8s API.

Elasticity and Cost Reduction

By exploiting K8s autoscaling and ECI’s per‑second billing, Spark jobs on K8s cost significantly less than on a fixed Yarn cluster while achieving higher resource utilization. ECI offers both regular and Spot instances; Spot instances provide low‑cost, pre‑emptible compute for batch workloads.

Celeborn Shuffle Service

Because K8s nodes have limited local disk, we adopted Alibaba’s open‑source Celeborn as an external shuffle service, removing the dependency on local disks and improving shuffle read/write performance via a push‑shuffle model.

Kyuubi on K8s

Kyuubi provides a multi‑tenant gateway for Spark, Flink, and Trino. We migrated Kyuubi to K8s to handle ad‑hoc Spark SQL queries when Yarn resources are scarce, ensuring consistent user experience.

K8s Manager

We built a lightweight Spring Boot service that watches Pods, Quotas, Services, ConfigMaps, Ingresses, Roles, etc., aggregates metrics, and generates custom alerts for CPU/Memory usage, Spark task counts, top‑resource consumers, and ECI distribution.

Other Work

Automatic scheduler switching between Yarn, K8s, and Auto based on Yarn queue utilization.

Multi‑AZ and multi‑switch support for ECI to ensure IP availability and instance inventory.

Cost calculation by correlating Executor resource usage with unit prices.

Optimization of Spark‑operator performance via increased coroutine count and batch event processing.

Upgrade of Spark K8s client to version 6.2.0 for bulk pod deletion, reducing API server pressure.

Problems and Solutions

Elastic NIC release latency

Slow release caused IP exhaustion; Alibaba Cloud upgraded the underlying service to speed up NIC release.

Watcher failure

Long‑running Spark jobs sometimes lost their Executor watcher; we added automatic watcher reset and contributed a fix to the Spark community.

Task deadlock

Exclusive reliance on Watch caused missed events; we re‑enabled List with a longer interval (5 min) and cached pod snapshots to reduce API load.

Celeborn read/write timeout

We enhanced Celeborn’s network code, tuned Master/Worker parameters, and upgraded the underlying ECI image to fix kernel bugs.

Quota lock contention

During massive concurrent submissions, Quota locks caused driver creation failures; we added configurable retry logic for driver pod creation.

UnknownHost errors

Simultaneous pod launches sometimes left elastic NICs unready, leading to DNS failures; we pre‑allocated NICs via Terway caching and assigned a dedicated Terway pod per ECS node.

Cross‑AZ packet loss

Network instability across zones affected task runtime; we set ECI scheduling to VSwitchOrdered to keep executors within a single zone.

Summary and Outlook

The cloud‑native big‑data architecture has been stable in production for about a year. Future work will focus on further optimizing the overall solution, containerizing more big‑data components, and achieving finer‑grained resource management and cost control.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Spark elastic computing

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.