Big Data 36 min read

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

The article explains how the vivo Internet Big Data team containerized offline Spark jobs and deployed them with the Spark Operator on a mixed online‑offline Kubernetes cluster, using elastic scheduling and resource‑over‑subscription to boost CPU utilization by 30‑40% and handle over 100,000 daily tasks.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

This article, authored by the vivo Internet Big Data team, describes how offline Spark jobs can be containerized and smoothly submitted to a mixed (online‑offline) Kubernetes cluster, thereby improving overall resource utilization.

It first explains the difference between online services (long‑running, latency‑critical) and offline tasks (short‑running, tolerant of latency). Because online and offline workloads have opposite peak periods, mixing them on the same machines can raise CPU utilization by 30‑40%.

The core of the solution is containerizing offline tasks. Two main architectures are compared:

Spark on K8s – Spark directly creates Driver and Executor Pods via the Kubernetes API.

Yarn on K8s – Yarn creates RM and NM Pods, and Spark runs inside the NM Pods.

Both have pros and cons; the team ultimately chose Spark on K8s with the Spark Operator because it offers finer‑grained resource control and better integration with the mixed‑deployment environment.

The article details the Spark Operator workflow, including the declarative YAML submission model and the additional controller components (SparkApplication controller, Submission Runner, Spark Pod Monitor). It also discusses why the traditional spark-submit approach is less suitable for large‑scale mixed deployment.

Final architecture choice : Spark on K8s with Spark Operator.

Spark image construction must consider complete Spark environment, matching JDK version (JDK 1.8 vs default JDK 11), and proper environment variables.

Typical command lines for the two job types are:

driver --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver -f {sql_file}

driver --class {jar_main_class} {jar_file} {args}

Initially the platform used initContainers and sidecar containers for config and log collection, but these were removed to reduce ETCD pressure and pod startup time.

Further optimizations include:

Removing initContainers and sidecars.

Adjusting overheadMemory and adding a fixed 100 MiB pod overhead.

Applying CPU over‑subscription (e.g., request 1 core, limit 0.8 core) to improve utilization.

The elastic scheduling system decides whether a task can be submitted to the mixed cluster based on resource water‑level thresholds, priority water‑level, and remaining capacity. It supports multiple clusters, time‑based controls, namespace spreading, and fallback to Yarn on failure.

Scheduling stability is enhanced by:

Short‑term submission limits.

Delayed, randomized submission to spread spikes.

Dynamic feedback control based on pending pod counts.

Cluster selection strategies evolved from simple resource‑sorted queues to weighted random queues and finally to a priority‑plus‑weighted‑random scheme, ensuring both large and small clusters receive appropriate workloads.

Results show that the mixed deployment now handles over 100,000 task submissions daily, adds hundreds of TB of memory capacity during off‑peak hours, and raises CPU utilization of mixed clusters to around 30 %.

Future work includes expanding mixed deployment to non‑standard script tasks, further increasing CPU utilization, and encouraging earlier offline‑task execution to maximize resource sharing.

Big DataKubernetesresource managementContainerizationSparkelastic scheduling
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.