Cloud Native 8 min read

Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads

This article describes a cloud‑native offline mixed‑deployment solution that leverages Kubernetes to share resources between big‑data clusters and business services, outlines its implementation steps, presents detailed performance comparisons between Yarn and Kubernetes using TPC‑DS, Spark, and Terasort workloads, and discusses production experience and future plans.

TAL Education Technology
TAL Education Technology
TAL Education Technology
Offline Mixed Deployment with Kubernetes: Architecture, Implementation, and Performance Evaluation for Big Data Workloads

Background: Kubernetes (k8s) and Alibaba Cloud Container Service (ACK) provide container orchestration, enabling deployment, scheduling, service discovery, and autoscaling for large‑scale clusters. Big‑data clusters consume substantial resources, often exhibiting a "tidal" usage pattern that can be offset by idle business‑service servers during off‑peak hours.

Objectives: 1) Enable resource sharing to mitigate tidal effects; 2) Accelerate cloud‑native adoption; 3) Achieve rapid elastic scaling of the cluster.

Offline Mixed‑Deployment Scheme: By exploiting k8s autoscaling, the NodeManager (NM) is made elastic. The solution includes modifying the ResourceManager (RM) to recognize pod‑NM registrations, adapting CDH pod images to run via scripts, and implementing Yarn node‑label logic to schedule containers on k8s nodes.

Peak‑Shaving Scheduling: An enhanced version adds timed pod‑NM launch and eviction functions, allowing big‑data jobs to run on idle business‑service resources during off‑peak periods.

Performance Tests – Yarn vs. k8s: TPC‑DS (Hive) tests on 1 TB, 3 TB, and 10 TB datasets showed k8s slightly faster on 1 TB (≈29 s) but Yarn ahead on larger volumes (up to 9 min). Spark tests indicated comparable runtimes, with k8s marginally faster (31 s to 263 s). Terasort (10 TB) demonstrated Yarn faster by ~1.5 h due to short‑circuit reads, while k8s outperformed Yarn by ~52 min on Spark‑based runs. Overall, differences were within 1‑1.5 % for most scenarios.

Production Run: The offline mixed‑deployment is now in production with 1000 vCore of k8s capacity, handling 600+ daily tasks. Container eviction tests succeeded without disrupting jobs, and most task runtimes fall under 5 minutes.

Future Plans & Issues: Current limitations include fixed pod‑NM resources and rigid launch/eviction rules. Future work aims to achieve fully automated scaling, smarter scheduling, full‑stack monitoring, and separation of compute and storage to further enhance cloud‑native big‑data operations.

Cloud Nativebig dataKubernetesperformance testingYARNK8sresource sharing
TAL Education Technology
Written by

TAL Education Technology

TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.