Big Data 18 min read

Migrating Big Data Workloads to Cloud‑Native Kubernetes: Challenges, Solutions, and Lessons from OPPO

This article describes how OPPO's big‑data team transitioned from traditional IDC and EMR environments to a cloud‑native Kubernetes architecture, detailing the motivations, design principles, elastic scaling challenges, custom solutions, and future directions for large‑scale data processing on the cloud.

DataFunSummit
DataFunSummit
DataFunSummit
Migrating Big Data Workloads to Cloud‑Native Kubernetes: Challenges, Solutions, and Lessons from OPPO

Background : OPPO’s big‑data team operates both a self‑built IDC cluster and an overseas EMR service, each with limitations in resource elasticity, cost control, and scaling. The IDC setup offers full‑stack components but suffers from fixed hardware capacity, while EMR provides elastic compute but still faces inefficient scaling and spot‑instance constraints.

Shift to Cloud‑Native Architecture : To overcome these issues, the team migrated to a public‑cloud native stack using Kubernetes (EKS) as the foundation, aiming for multi‑cloud migration, finer resource control, and lower operational costs. Key benefits include standardized component migration, improved flexibility, and reduced service fees compared to EMR.

Design Principles : The new architecture adopts Kubernetes for compute, supports multi‑cloud portability, implements a unified access layer, and separates storage from compute. It also containerizes data services (e.g., Livy, Spark Server) and leverages Kubernetes operators for custom elastic scaling.

Extreme Elasticity Challenges & Solutions : The team identified problems such as over‑provisioning from pending‑task‑based scaling, the need for dynamic resource‑usage‑aware autoscalers, and multi‑tenant resource isolation. Solutions include a custom Kubernetes Operator‑based scaler, load‑trend modeling, periodic scaling policies, differentiated scaling rates for driver versus worker pods, and delete‑cost‑driven safe node termination.

Resource Optimization : Efforts focus on maintaining 60‑70% utilization, supporting both ARM and x86 architectures, classifying workloads for tailored machine types, and applying spot instances where appropriate. The system also introduces unified proxy services, standardized component deployment, and storage‑compute separation to improve elasticity.

Future Outlook : Planned improvements involve automated instance‑type selection based on real‑time pricing and workload characteristics, finer security and network isolation for multi‑tenant scenarios, and continued enhancements to Kubernetes scheduling performance for large‑scale batch jobs.

Q&A Highlights : The team chose not to adopt Spark on Kubernetes directly due to stability concerns, highlighted multi‑tenant management challenges requiring third‑party schedulers, and explained why custom scaling solutions were preferred over the Spark Operator.

Cloud NativeBig DataKubernetesMulti-Cloudelastic scalingResource SchedulingSpark
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.