Big Data 21 min read

Spark on Kubernetes: Practices and Optimizations at Eggplant Technology

This article explains how Spark can be effectively deployed on Kubernetes, covering its advantages over traditional Hadoop clusters, the principles of Spark on K8s, dynamic allocation, reuse PVC enhancements, scheduling optimizations, and real‑world performance results from Eggplant Technology's production use.

DataFunSummit

Apr 10, 2023

Spark on Kubernetes: Practices and Optimizations at Eggplant Technology

Spark, a leading open‑source big‑data engine, faces new challenges when moving to cloud‑native environments. Traditional Hadoop clusters suffer from high cost, low flexibility, and tight storage‑compute coupling, while public clouds offer cheap object storage, elastic compute, and spot instances.

Deploying Spark on Kubernetes replaces Yarn with K8s as the scheduler, providing simple deployment, containerized execution, and high resource utilization. Spark on K8s leverages cloud‑native services such as managed K8s (EKS, CCE, GKE) and avoids maintaining control‑plane nodes.

The article details Spark‑on‑K8s fundamentals, including the four deployment modes (Standalone, YARN, Mesos, Kubernetes) and the driver‑executor lifecycle. It explains dynamic allocation, shuffle tracking, and the security implications of using plain AK/SK credentials.

Eggplant Technology’s production experience is described: using Contour to expose Spark UI, applying node‑selector and affinity/anti‑affinity rules to keep drivers on dedicated nodes, and running executors on spot instances with soft and hard affinity policies to reduce cross‑AZ traffic and improve cost efficiency.

A major focus is the reuse‑PVC feature introduced in Spark 3.2, which preserves shuffle data on persistent volumes when executors are lost. The open‑source implementation has limitations (inaccurate recovery, delayed metadata reporting, path mismatches). Eggplant’s custom enhancements include a PVC state set, early shuffle metadata reporting, correct path handling, and modified executor‑lost handling to avoid full task recomputation.

Performance tests show a clear improvement of Spark 3.2.2 over 3.0.1 after the PVC redesign. Additional work includes automatic driver pod cleanup, ELK‑based log collection, DNS configuration tweaks, predicate push‑down, multi‑dimensional analysis optimizations, and cloud‑cost accounting.

Future directions involve solving shuffle reuse fundamentally via Remote Shuffle Service, improving interactive Spark job submission, and building a lightweight, pod‑aware log‑viewing system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Scheduling performance-optimization reuse-pvc

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.