Big Data 13 min read

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

DataFunTalk
DataFunTalk
DataFunTalk
Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

OPPO's big‑data platform hosts over 20 services, more than 1 EB of data, nearly a million offline tasks and thousands of real‑time tasks, creating complexity for users and operators who need rapid fault isolation and optimization suggestions.

The open‑source project Compass (https://github.com/cubefs/compass) was built to address this gap, delivering real‑time, non‑intrusive diagnostics for scheduling platforms such as DolphinScheduler and Airflow.

Core Features

Non‑intrusive, instant diagnosis without modifying existing schedulers.

Supports multiple mainstream schedulers (DolphinScheduler, Airflow, custom).

Handles Spark and Hadoop 2.x/3.x task logs.

Workflow‑level anomaly detection (failure types, baseline time deviations).

Engine‑level anomaly detection (data skew, large table scans, memory waste, 14 other types).

Customizable log‑matching rules and threshold adjustments.

Diagnostics cover:

Failure analysis (first‑failure, final‑failure, long‑term failure) with root‑cause tracing.

Time‑abnormal tasks (baseline time, baseline duration, excessive runtime).

Spark engine issues (SQL failures, shuffle problems, OOM, large scans, data skew, task tail, global sort, OOM warnings, job/stage latency, HDFS stalls, speculative tasks).

Resource usage visualization (CPU, memory, GC logs).

One‑click diagnosis and comprehensive report overview.

Technical Architecture

Compass consists of modules for synchronizing workflow metadata, Yarn/Spark application metadata, linking workflow and engine data, anomaly detection at both layers, and a portal for visualization.

The system is organized into three layers:

External integration layer (schedulers, Yarn, HistoryServer, HDFS) for metadata and log ingestion.

Architecture layer (data collection, model standardization, anomaly detection, portal).

Infrastructure layer (MySQL, Elasticsearch, Kafka, Redis).

Key processing stages include data collection, data association & model standardization, workflow & engine anomaly detection, and business view generation.

Deployment & Usage with DolphinScheduler

Example deployment on DolphinScheduler 2.0.6:

git clone https://github.com/cubefs/compass.git
cd compass
mvn package -DskipTests

Configure the environment:

cd dist/compass
# Edit data source and related settings
vim bin/compass_env.sh

Start all services:

./bin/start_all.sh

After deployment, access the UI at http://localhost:7075/compass/ , log in with DolphinScheduler credentials, and view real‑time diagnostic results for tasks.

Open‑Source Roadmap & Contribution

Current focus on offline scheduling tasks and Spark engine diagnostics.

Future releases will add Flink task diagnostics and deeper algorithmic models for rule‑free intelligent anomaly detection.

Community contributions are welcomed via the GitHub repository (https://github.com/cubefs/compass).

For any issues or feature requests, submit an issue on GitHub; the OPPO AndesBrain team will respond promptly.

monitoringbig dataopen sourceSparkDolphinSchedulerTask Diagnosis
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.