Big Data 27 min read

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details how Youzan's offline Spark computing platform was transformed for the cloud‑native era by migrating from YARN to Kubernetes, introducing containerization, storage‑compute separation, dynamic allocation, deployment optimizations, and a collection of practical lessons to reduce cost and improve resource utilization.

DataFunTalk

Feb 28, 2021

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

1. Introduction

With rapid business growth, data platform costs have risen sharply; the 2020 goal was to build low‑cost, high‑efficiency underlying services. This article describes how Youzan's seven‑year‑old offline compute platform embraced cloud‑native technologies—containerization, elastic scaling, and staggered component deployment—to achieve cost reduction even as business volume multiplied.

Current offline compute status includes a new 10 Gbps network cluster, >90% of tasks running on Spark (migrated from Hive in 2019), and a storage‑compute mixed‑mode causing a “bucket effect” where expanding compute resources leaves storage under‑utilized.

New directions focus on storage‑compute separation, higher machine utilization via K8s mixing, and finer‑grained Docker resource isolation.

The remainder of the article covers the technical solution, Spark migration, deployment optimizations, and experience gained.

2. Technical Solution

Moving from YARN to K8s raises two challenges: loss of executor dynamic allocation and handling shuffle/sort data after storage‑compute separation.

2.1 Dynamic Allocation

Dynamic allocation improves resource utilization but requires an external shuffle service in YARN, which is absent in K8s. The team adopted SPARK‑27963 to enable allocation without an external service, tracking shuffle references and releasing executors only when their shuffle data is no longer needed.

Two possible shuffle‑service solutions for K8s are:

Deploy shuffle service as a DaemonSet (early Spark 2.2 experimental feature, requires hostPath volume and ties pods to specific hosts).

Remote storage for persisting shuffle data (SPARK‑25299), still under community development; Uber’s RemoteShuffle is being trialed.

2.2 Storage Volume (PV) Selection

In YARN, large local disks were shared for compute and storage. After separation, Spark’s shuffle and sort disk needs must be satisfied.

hostPath

The team evaluated cloud providers and chose to mount multiple cloud block disks (CBS) on compute nodes and reference them via hostPath in Spark pods. Drawbacks include pod‑host binding and limited I/O throughput per disk.

Ceph

Ceph offers high read/write performance, striping, and three storage modes (block, file, object). For general pods, Ceph RBD is preferred; for Spark pods with heavy shuffle, CephFS (ReadWriteMany) is explored, allowing local‑like shuffle access without an external service.

3. Spark Refactoring

Spark 2.3.3 does not officially support K8s, so several modifications were required.

3.1 Small File Merging

Too many small shuffle files degrade performance and stress the NameNode. A new merge operation replaces the previous MapReduce‑based approach, reducing NameNode memory pressure.

3.2 Log Collection Service

Logs disappear when executor pods terminate. A Filebeat‑to‑Kafka pipeline was added, tagging logs with app type, app ID, and executor ID for reliable collection and downstream analysis.

3.3 Remote Shuffle Service

Uber’s open‑source remote shuffle service was integrated, allowing executors to release resources promptly after tasks finish, improving stability.

3.4 Spark K8s Driver Pod Construction Order

Driver pods now wait for required ConfigMaps and volumes to be created before the pod itself is instantiated, preventing startup failures.

3.5 Distributing Local Resources

Two approaches were considered: bundling resources in the Spark image (high disk usage) or uploading them to HDFS and downloading at runtime. The latter was chosen.

3.6 Web UI Exposure

Ingress and Service resources expose the Spark UI to the office network, handling pod IP changes transparently.

3.7 Spark Pod Label Extension

Separate nodeSelectors for driver and executor pods enable driver stability while allowing executors to be scheduled on varied nodes.

3.8 Spark App State Management

spark‑submit now monitors driver pod exit codes and propagates them, aligning Airflow’s success detection with actual Spark job outcomes. A shutdown hook deletes the driver pod on SIGTERM to keep Airflow state consistent.

3.9 Dynamic Adjustment of dynamicAllocation.maxExecutors

Since the parameter cannot be changed at runtime, a custom mechanism was added to modify it without restarting the service during peak loads.

4. Deployment Optimizations

4.1 Staggered Co‑Location

By mixing offline Spark workloads with other services during off‑peak hours, resource utilization improves without affecting real‑time workloads.

4.2 Elastic Scaling

Using Tencent Cloud’s K8s auto‑scaling, the cluster expands during 00:00‑09:00 peak periods and contracts afterward, saving costs while meeting demand.

5. Pitfalls and Experience

5.1 K8s Accidentally Killing Executors

A containerd bug caused lingering shim processes that repeatedly sent KILL signals to PIDs, sometimes terminating unrelated Java threads. Upgrading containerd resolved the issue.

5.2 Linux Kernel Parameter Tuning

High executor connection counts filled the accept queue (min(somaxconn, backlog)), causing connection resets. Adjusting /proc/sys/net/core/somaxconn and spark.shuffle.io.backLog mitigated the problem.

5.3 Executor Loss Leading to Task Stalls

Changing executor removal logic from disableExecutor to removeExecutor prevented tasks from hanging after executor loss.

5.4 Memory Contention in Multi‑Task Executors

When executors host multiple cores, tasks compete for ExecutionMemoryPool, sometimes causing long waits. A timeout mechanism was added to fail tasks early and reschedule them.

while (true) {
  val numActiveTasks = memoryForTask.keys.size
  val curMem = memoryForTask(taskAttemptId)
  maybeGrowPool(numBytes - memoryFree)
  val maxPoolSize = computeMaxPoolSize()
  val maxMemoryPerTask = maxPoolSize / numActiveTasks
  val minMemoryPerTask = poolSize / (2 * numActiveTasks)
  val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
  val toGrant = math.min(maxToGrant, memoryFree)
  if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {
    logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")
    lock.wait()
    // cause dead lock
  } else {
    memoryForTask(taskAttemptId) += toGrant
    return toGrant
  }
}

5.5 Corrupted Shuffle Data Blocks

Shuffle corruption errors were reduced by applying SPARK‑30225 and SPARK‑26089, which improve byte‑buffer handling and early detection of corrupted blocks.

5.6 Spark Configuration Loading Order

When resources are packaged in a fat JAR, --files may be ignored. Adding the executor’s user‑dir to the classpath resolves the issue.

5.7 Handling Resource Quota Exhaustion

When a namespace exceeds its quota, Spark jobs previously failed immediately. A retry‑with‑backoff strategy (e.g., 50 attempts with 10 s intervals) now allows jobs to wait for resources.

6. Conclusion

Youzan’s offline Spark workloads have successfully transitioned from YARN to Kubernetes, achieving cloud‑native benefits such as storage‑compute separation, containerization, and elastic scaling, while addressing numerous operational challenges and improving overall cost efficiency and reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Kubernetes Resource Management Spark

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.