Big Data 18 min read

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Hulu’s data platform team describes how they migrated large‑scale Spark workloads from Yarn to native Spark on Kubernetes, leveraging AWS services such as EKS, S3, and custom operators to achieve dynamic scaling, unified monitoring, cost‑effective resource management, and improved stability for search, recommendation, and advertising pipelines.

Hulu Beijing

Oct 21, 2022

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Background

In Hulu’s business, large‑scale data processing is required for search, recommendation, and advertising. Hulu built a suite of Hadoop‑ecosystem components deployed across data centers, providing offline and real‑time processing via Spark/Flink on Yarn.

Integration with Disney streaming and advertising increased data volume, prompting a redesign of the big‑data infrastructure to improve elasticity, reduce maintenance cost, and enhance stability from a cloud‑native perspective.

Why Spark on Kubernetes?

Since Spark 2.3, native support for submitting jobs to Kubernetes offers better resource isolation, management, and elastic scaling than Yarn, making it a mainstream deployment trend.

Cloud‑Native Stack

Hulu uses AWS services such as EKS (managed Kubernetes) and S3 (object storage) to reduce operational overhead and align with Disney’s AWS‑based infrastructure.

Spark on Kubernetes Fundamentals

Native Spark on K8s lacks lifecycle management; operators are used to abstract Spark applications as custom Kubernetes resources.

Driver interacts with the API server (acting as ClusterManager) to create Executor pods, which register back to the Driver. After completion, Executors are cleaned up while the Driver remains in Completed state.

Submit Spark job via spark‑submit; API server creates Driver pod.

Scheduler assigns Driver pod to a node.

Driver requests Executor pods from API server.

Executors start, register with Driver, and run tasks.

On finish, Executors are removed; Driver stays Completed.

Running spark‑submit requires pod create/delete permissions, and the master address points to the API server. Spark on K8s supports client and cluster modes.

Operator Deployment

Using a Spark Operator (e.g., GCP’s Spark on Kubernetes Operator), SparkApplication CRDs allow users to describe jobs in YAML. The operator submits the job, creates UI services, patches Executor pods, and cleans up resources.

Operator submits Spark job and creates UI/Ingress.

Driver pod requests Executor pods.

MutatingAdmissionWebHook patches Executor pods.

Executors register with Driver.

Driver runs user code.

Driver finishes; SparkContext ends.

Operator cleans up pods and services.

Job Submission Interface

A custom spark‑submitter service receives job submissions and status queries, exposing a CLI‑like interface and an Airflow operator for scheduled jobs. Artifacts are uploaded to S3 for execution.

Resource Queues and Priorities

Custom resource queues define usage limits without hierarchical YARN queues, and task priority leverages Kubernetes PriorityClass.

Different queues are used for batch, streaming, and ad‑hoc queries.

Monitoring and Logging

Fluentd DaemonSet collects container stdout/stderr and forwards to Elasticsearch. Cluster‑level event and audit logs are also synced to ES via EKS configuration.

Datadog agents auto‑discover Spark UI metrics using a Spark configuration snippet.

Dynamic Scaling

AWS Auto Scaling Groups expand or shrink EC2 nodes based on pending pod demands. Spot instances are used for non‑critical Spark tasks to reduce cost, with node selectors and tolerations directing workloads.

Storage and Compute Layer

Glue provides metadata management; S3 stores the data warehouse. Users query via Athena/Presto or Spark Thrift Server, while ETL jobs run as Spark/Flink jobs.

Data Quality

Quality checks include constraint validation, metric collection, alerting, and visualization, integrated via Airflow tasks and a quality‑check library that asserts DataFrame properties at runtime.

Challenges and Future Work

Issues such as data locality on S3, dynamic allocation without external shuffle service, batch scheduling, and authentication across AWS IAM, Kubernetes RBAC, and Hulu’s internal accounts are discussed, with proposed solutions.