Big Data 36 min read

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

The article explains how the vivo Internet Big Data team containerized offline Spark jobs and deployed them with the Spark Operator on a mixed online‑offline Kubernetes cluster, using elastic scheduling and resource‑over‑subscription to boost CPU utilization by 30‑40% and handle over 100,000 daily tasks.

vivo Internet Technology

Apr 16, 2025

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

This article, authored by the vivo Internet Big Data team, describes how offline Spark jobs can be containerized and smoothly submitted to a mixed (online‑offline) Kubernetes cluster, thereby improving overall resource utilization.

It first explains the difference between online services (long‑running, latency‑critical) and offline tasks (short‑running, tolerant of latency). Because online and offline workloads have opposite peak periods, mixing them on the same machines can raise CPU utilization by 30‑40%.

The core of the solution is containerizing offline tasks. Two main architectures are compared:

Spark on K8s – Spark directly creates Driver and Executor Pods via the Kubernetes API.

Yarn on K8s – Yarn creates RM and NM Pods, and Spark runs inside the NM Pods.

Both have pros and cons; the team ultimately chose Spark on K8s with the Spark Operator because it offers finer‑grained resource control and better integration with the mixed‑deployment environment.

The article details the Spark Operator workflow, including the declarative YAML submission model and the additional controller components (SparkApplication controller, Submission Runner, Spark Pod Monitor). It also discusses why the traditional spark-submit approach is less suitable for large‑scale mixed deployment.

Final architecture choice : Spark on K8s with Spark Operator.

Spark image construction must consider complete Spark environment, matching JDK version (JDK 1.8 vs default JDK 11), and proper environment variables.

Typical command lines for the two job types are:

driver --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver -f {sql_file}

driver --class {jar_main_class} {jar_file} {args}

Initially the platform used initContainers and sidecar containers for config and log collection, but these were removed to reduce ETCD pressure and pod startup time.

Further optimizations include:

Removing initContainers and sidecars.

Adjusting overheadMemory and adding a fixed 100 MiB pod overhead.

Applying CPU over‑subscription (e.g., request 1 core, limit 0.8 core) to improve utilization.

The elastic scheduling system decides whether a task can be submitted to the mixed cluster based on resource water‑level thresholds, priority water‑level, and remaining capacity. It supports multiple clusters, time‑based controls, namespace spreading, and fallback to Yarn on failure.

Scheduling stability is enhanced by:

Short‑term submission limits.

Delayed, randomized submission to spread spikes.

Dynamic feedback control based on pending pod counts.

Cluster selection strategies evolved from simple resource‑sorted queues to weighted random queues and finally to a priority‑plus‑weighted‑random scheme, ensuring both large and small clusters receive appropriate workloads.

Results show that the mixed deployment now handles over 100,000 task submissions daily, adds hundreds of TB of memory capacity during off‑peak hours, and raises CPU utilization of mixed clusters to around 30 %.

Future work includes expanding mixed deployment to non‑standard script tasks, further increasing CPU utilization, and encouraging earlier offline‑task execution to maximize resource sharing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data kubernetes Resource Management containerization Spark elastic scheduling

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.