Big Data 10 min read

OPPO Big Data Platform Operations and R&D Practices: Architecture, Scaling, and Monitoring

This article summarizes OPPO's rapid growth of its big‑data platform, detailing the three‑layer architecture, the evolution from Flume‑Kafka to NiFi for data ingestion, the upgrade of the OFlow task scheduler, comprehensive monitoring of data, resources and task SLA, and the development of a self‑service analytics tool called InnerEye to ensure stability, efficiency, and security.

DataFunTalk
DataFunTalk
DataFunTalk
OPPO Big Data Platform Operations and R&D Practices: Architecture, Scaling, and Monitoring

Based on OPPO engineer Zhang Jun's talk at DataFun, this article outlines the operational and R&D practices behind OPPO's rapidly expanding big‑data platform, highlighting challenges and solutions.

The platform follows a three‑layer architecture: a foundational layer built on Hadoop ecosystem components for real‑time and batch processing, a middle layer of custom services for data ingestion, task scheduling, and internal analytics, and an upper layer for applications such as click‑through prediction, personalized recommendation, and user profiling.

Data ingestion originally relied on a Go‑based HTTP server, a Python tail‑like tool, Flume, and Kafka, which caused long pipelines and high maintenance costs. In version 2.0 the pipeline was shortened and unified under JVM‑based Apache NiFi, offering a visual interface, flexible data transformation, buffering queues, and seamless integration with Kafka and HDFS.

The task scheduling system also evolved from a shell‑script collection (1.0) to the visual OFlow platform (2.0), providing a graphical UI, task‑tree view, and Gantt chart for easier dependency management and performance diagnosis.

Monitoring improvements include fine‑grained Airflow metrics, real‑time SLA dashboards, and visualizations of data throughput, ingestion bottlenecks, downstream back‑pressure, and resource pool utilization (usage rate, shortage rate, overspend, job count, and consumption share).

Task SLA management classifies core business jobs, tracks progress with a weighted formula, and alerts on delays via email or phone. Quality metrics such as on‑time rate and early‑completion score are visualized.

To proactively detect issues, OPPO uses the internal tool “Dr.Elephant” (Big Elephant Doctor) to flag abnormal map sizes and resource over‑provisioning, enabling timely remediation.

For analysts, OPPO built a self‑service platform named InnerEye, featuring a unified GUI, Redis caching for sub‑10 ms latency, Elasticsearch‑backed tag updates, and Hive‑to‑Redis/ES data pipelines, offering better usability and resource efficiency compared to commercial tools like Tableau or the open‑source Hue.

Security measures include unified account management, authorization controls, and audit logging, completing the vision of a stable, efficient, and secure data platform.

—END

monitoringbig dataoperationsresource managementdata platformAirflowNiFi
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.