Big Data 11 min read

OPPO Public Cloud Big Data Cost Optimization: Architecture, Challenges, and Results

This article presents OPPO's CloudCamel solution for big data cost optimization on public cloud, detailing domestic and overseas architectures, elastic redesigns, component innovations, migration considerations, performance metrics, and future technical and operational directions.

DataFunSummit

Apr 20, 2024

OPPO Public Cloud Big Data Cost Optimization: Architecture, Challenges, and Results

OPPO's public‑cloud big data cost‑optimization solution, named CloudCamel, is designed to achieve the lowest possible cost for big‑data computation in the cloud.

Domestic architecture: The top layer uses Oflow for resource scheduling and tools such as link identification and time prediction. The access layer provides self‑developed SQL profiling, unified entry, and multi‑engine routing. The engine layer runs Spark, Trino, and Flink for batch, stream, and interactive queries, with the self‑built Shuttle accelerator for shuffle, single‑point sorting, and broadcast memory expansion. The metadata layer builds on WaggleDance, partitioned into groups for disaster recovery and acceleration. The cloud‑fusion scheduling layer introduces innovations like resource throttling, over‑selling, and hierarchical task scheduling, while the lake‑warehouse layer includes the self‑developed Glacier and automatic lake entry and indexing. Storage uses HDF for hot data and Cubefs for cold data.

Overseas architecture: OPPO leverages EMR on AWS, with Oflow for orchestration and HiveServer for SQL execution. Tasks are routed to EMR clusters across three AZs in Singapore, using a mix of spot (low‑cost, preemptible) and on‑demand (stable) instances. Storage relies on S3's automatic tiering, moving infrequently accessed data to cold storage after three months.

Architecture optimization: Most clusters have migrated to a self‑built EKS‑based elastic architecture, making all components elastic and freeing resources. Compute resources are fully spot‑based, with automatic scaling according to workload fluctuations. Manual data tiering was tested but found less advantageous than S3's automatic tiering.

Elastic architecture design: An auto‑scaling management module continuously updates resource allocation, and a machine‑selection module chooses optimal instances. The system retains Yarn‑based operations for stability and leverages Yarn's superior peak‑period scheduling efficiency over Kubernetes.

Shuttle component: Shuttle acts as an engine accelerator, speeding up shuffle, single‑point sorting, and expanding Spark broadcast memory from ~10 MB to ~10 GB, enabling efficient large‑scale broadcasts.

Challenges of moving big data to the cloud: While cloud services offer strong convenience, custom technical requirements may be hard to satisfy. Cost considerations include compute, personnel, and scale, with spot instances recommended for lower CPU costs. Response speed is fast for routine issues but may require formal case handling for complex problems.

Cloud migration benefits and recommendations: Use S3 for storage, prefer EMS over EMR for compute, adopt cloud‑native self‑built solutions if the team has development capability, and consider OLAP for small‑scale, high‑performance workloads.

Cost‑reduction metrics: A resource dashboard visualizes AZ cluster metrics such as node count, utilization, and task numbers, enabling algorithmic tuning. EMR resource charts show vCore and memory usage. Rapid scaling schemes improve expansion and contraction efficiency. Physical resource utilization averages above 75 %. Container heatmaps guide workload consolidation to avoid long‑tail node waste. A cost dashboard provides real‑time per‑minute cost tracking for precise cost control.

Outlook for CloudCamel: Technically, future work includes deeper engine optimization, extending support to Trino, QoS‑aware scheduling, mixed offline workloads, comprehensive data‑governance, and a white‑screened cloud‑ops platform. Operationally, bill‑analysis diagnostics will be expanded, and the entire solution will be productized and platformized.

Q&A: Hive version 2.3, Spark 3.1.2, and Flink 1.6 are used; rapid engine optimizations limit immediate upgrades to newer community releases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Cost Optimization EKS Public Cloud OPPO Shuttle

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.