Big Data 17 min read

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

DataFunTalk

Dec 29, 2022

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

On January 7, OPPO announced the release of the industry's first data‑intelligence knowledge map and introduced a five‑year anniversary live broadcast. The announcement also highlighted the upcoming launch of a big‑data diagnostic platform, code‑named "Compass," aimed at helping users locate problems and receive optimization recommendations.

Background

OPPO's big‑data platform now comprises over 20 components, stores more than 1 EB of data, and serves around 1,000 daily active users. Users frequently encounter difficulty pinpointing issues due to the large number of components (scheduler, Livy client, Spark engine, Hadoop system), massive job logs, and diverse user roles. The main pain points are low problem‑localization efficiency, a wide variety of exception types without an effective knowledge base, and high resource waste caused by abnormal or mis‑configured tasks.

Industry Products

The team evaluated existing solutions such as the open‑source Dr. Elephant, which monitors Hadoop and Spark performance, collects metrics from schedulers (Airflow, Azkaban, Oozie), and generates diagnostic reports. While Dr. Elephant offers integration with multiple schedulers, performance metrics, and a rule‑based plugin system, it lacks support for newer Spark/Hadoop versions, comprehensive diagnostic indicators, log‑level analysis, and resource‑cost reduction guidance.

Technical Solution

Given the identified gaps, OPPO decided to develop its own diagnostic system with a non‑intrusive design. The architecture consists of three layers:

Layer 1 – Integration: connects to external systems (schedulers, Yarn, HistoryServer, HDFS) to synchronize metadata, cluster status, environment status, and logs.

Layer 2 – Core Architecture: includes data collection, metadata association & model standardization, anomaly detection, and a portal for visualization.

Layer 3 – Infrastructure: provides foundational components such as MySQL, Elasticsearch, Kafka, and Redis.

The platform workflow is divided into four stages:

Data Collection : synchronizes workflow metadata, DAGs, job execution records, Yarn ResourceManager data, and Spark HistoryServer data.

Data Association & Model Standardization : links workflow and engine metadata via ApplicationID to build a unified data model (user, DAG, task, application, clusterConfig, timestamp).

Anomaly Detection (Workflow & Engine) : applies heuristic rules and a knowledge‑base to detect anomalies in both workflow and engine layers, producing diagnostic results.

Business View : aggregates and presents diagnostics such as task overview, workflow‑level issues (failed tasks, loop tasks, baseline deviations) and engine‑level issues (long‑running tasks, resource waste, runtime errors).

Practical Effects

The platform delivers four major benefits:

UI : visualizes engine‑level anomalies (CPU waste, data skew, long‑tail tasks, large table scans) with clear tags.

Efficiency Analysis : identifies long‑tail tasks, HDFS slowdown, speculative execution excess, and global sort anomalies, providing root‑cause explanations and remediation suggestions.

Cost Analysis : quantifies CPU and memory waste, presents formulas for total CPU time, memory usage, and waste percentages, and recommends parameter tuning (e.g., reducing spark.executor.cores or spark.executor.memory).

Stability Analysis : detects full‑table scans, data skew, shuffle failures, memory overflow (both on‑heap and off‑heap), and common SQL errors, offering diagnostic steps and mitigation strategies such as increasing shuffle partitions, using broadcast joins, or adjusting executor resources.

These analyses enable users to rank tasks by cost, department, or individual, driving data‑governance initiatives and continuous cost‑reduction.

Summary & Planning

OPPO's diagnostic platform focuses on offline scheduled tasks and compute engines, leveraging a rich knowledge base to provide optimization suggestions and achieve cost‑saving goals. The non‑intrusive approach ensures safety for integrated systems. Future work includes expanding the knowledge base with data‑mining algorithms for deeper detection, adding support for Flink tasks, and open‑sourcing the platform.

Authors

Bob Zhuang – Senior Data Platform Engineer at OPPO (formerly at Kingsoft). Xiaoyou Wang – Data Platform Engineer at OPPO, joined in 2019 with extensive backend development experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Resource Management Spark Hadoop cost reduction diagnostic platform

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.