Big Data 23 min read

Inside OPPO’s Real‑Time Computing Platform: Architecture, Practices, and Future Roadmap

This article details OPPO’s real‑time computing platform, covering its business scope, big‑data architecture built on Flink, Spark and Trino, the end‑to‑end job development lifecycle, SQL IDE features, diagnostic and monitoring mechanisms, link latency tracking, SLA guarantees, practical use cases, and upcoming lakehouse and cloud‑native evolution.

dbaplus Community

Feb 23, 2022

Inside OPPO’s Real‑Time Computing Platform: Architecture, Practices, and Future Roadmap

Background

OPPO’s data platform supports user services, app store, games, content and e‑commerce. It stores >600 PB of data and ingests billions of rows daily, processing several petabytes per day. The platform combines open‑source components (Flink, Spark, Trino, Yarn) with self‑developed modules for data ingestion, real‑time computation, offline batch, interactive analysis and data quality.

Architecture

The real‑time layer is built on Flink and offers both SQL and JAR job development.

Interactive UI – a one‑stop development page that includes a SQL IDE, a JAR IDE and job monitoring tools.

Data API / Open API – business‑logic services (Data API) and external service exposure (Open API).

Job Gateway – compiles jobs, selects a deployment plug‑in (Yarn or Kubernetes) and dispatches the job to the target cluster.

Backend – periodically reconciles job metadata with the actual runtime status on Yarn/K8s and triggers start/stop actions.

MetaData Service – stores all job definitions, table catalogs and version information.

Intelligent Monitoring – aggregates metrics and logs from every module for alerting and analysis.

Design Principles

Usability : a unified SQL IDE enables job creation, submission and lifecycle management.

Availability : all components run in HA mode; a single instance failure does not affect the service.

Scalability : the Gateway is abstracted with plug‑in deployers, supporting Yarn, Kubernetes and multiple Flink versions.

Real‑Time Development Workflow

Development – developers write SQL or JAR jobs in the IDE; the IDE sends the definition to the API, which validates it and persists it in MetaStore.

Debugging – the API validates SQL syntax and table permissions, then the Gateway compiles the job. Compilation results are returned to the IDE for iterative fixing.

Submission – after validation the API forwards the job to the Gateway; the Gateway selects the appropriate plug‑in (Yarn or K8s) and submits the job to the cluster.

Status Maintenance – the Backend periodically compares the logical status in MetaStore with the actual runtime status from Yarn/K8s, and automatically restarts or terminates jobs to keep the recorded state consistent.

SQL IDE

The IDE shows job metadata (available tables, libraries), provides a SQL editor with formatting, auto‑completion and version management, and displays debugging results. Over 3,000 jobs run on the platform, >80 % of which are developed via SQL.

Job Diagnosis System

The diagnosis subsystem collects metrics and logs throughout a job’s lifecycle, correlates them with metadata, and stores analysis results in a relational database and Elasticsearch for traceability.

When a job restarts, the system first queries Flink’s REST API for detailed exception information. If the REST endpoint is unavailable, it falls back to log data collected by LogAgent on each Yarn node. The diagnosis result (including root‑cause analysis and optimization suggestions) is written to the DB and indexed in ES for later retrieval.

Link Monitoring

Data flow: OBUS → Kafka → Flink. Four timestamps are recorded: server_time – OBUS receives the raw event. parse_time – OBUS finishes preprocessing. kafka_timestamp – Kafka stores the message. process_time – Flink consumes the record.

Custom metrics compute the latency of each stage and report them to the monitoring platform. Alerts can be configured per stage to enable rapid fault localization.

Real‑Time SLA Dashboard

The SLA dashboard combines business‑defined latency tolerances with the measured delays from link monitoring to calculate an on‑time rate for each job. Jobs that miss the SLA are highlighted for further diagnosis.

Application Practices

Real‑Time Data Warehouse

Source systems (applications, MySQL, Oracle, etc.) write change events to Kafka. Flink reads the streams, performs splitting, cleaning and aggregation to produce ODS, DWD and final business tables. This pattern is used across most internal services.

Real‑Time Dashboards for E‑commerce Events

Two typical pipelines are used for high‑traffic events (e.g., 618, Double‑11):

Classic pipeline : MySQL → Canal → Kafka → Flink → DB. This mature stack provides stable monitoring and alerting.

Flink CDC pipeline : Flink reads MySQL binlog directly, shortening the path and reducing latency, but it is still being stabilized for complex aggregations.

The classic pipeline is currently preferred for its proven reliability.

Future Plans

Lakehouse Integration with Apache Iceberg

OPPO plans to adopt Iceberg for a unified lake‑warehouse architecture, reducing data movement and storage costs. Kafka‑to‑Iceberg and CDC‑to‑Iceberg pipelines are already operational; work is in progress to enable Flink to read directly from Iceberg tables.

Cloud‑Native Scheduling on Kubernetes

The platform will evolve from Yarn‑only scheduling to Kubernetes, allowing elastic scaling and mixed‑workload scheduling alongside other online services. The Job Gateway will load a Kubernetes plug‑in and support per‑job SQL submissions similar to Yarn.

Q&A

Q1: How is Kafka table metadata managed? Two approaches are used: a legacy MySQL catalog for existing jobs and a Flink‑Hive (HMS) catalog for newer jobs. Both coexist during migration.

Q2: How to add a new field to a Kafka table? Update the table definition in the UI (for JSON format) and redeploy all dependent jobs so the new schema takes effect.

Q3: How are dimension tables joined with sharded MySQL tables? Dimension data is kept in a single table; if sharding is required, a UNION of the shards is performed before the join.

Q4: How is Kubernetes cloud‑native support being implemented? K8s is in the research phase. The plan is to adapt the Application mode to support per‑job SQL submissions, mirroring Yarn’s per‑job capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Flink Real-Time Computing big data platform job monitoring

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.