Inside OPPO’s Real‑Time Computing Platform: Architecture, Practices, and Future Roadmap
This article details OPPO’s real‑time computing platform, covering its business scope, big‑data architecture built on Flink, Spark and Trino, the end‑to‑end job development lifecycle, SQL IDE features, diagnostic and monitoring mechanisms, link latency tracking, SLA guarantees, practical use cases, and upcoming lakehouse and cloud‑native evolution.
Background
OPPO’s data platform supports user services, app store, games, content and e‑commerce. It stores >600 PB of data and ingests billions of rows daily, processing several petabytes per day. The platform combines open‑source components (Flink, Spark, Trino, Yarn) with self‑developed modules for data ingestion, real‑time computation, offline batch, interactive analysis and data quality.
Architecture
The real‑time layer is built on Flink and offers both SQL and JAR job development.
Interactive UI – a one‑stop development page that includes a SQL IDE, a JAR IDE and job monitoring tools.
Data API / Open API – business‑logic services (Data API) and external service exposure (Open API).
Job Gateway – compiles jobs, selects a deployment plug‑in (Yarn or Kubernetes) and dispatches the job to the target cluster.
Backend – periodically reconciles job metadata with the actual runtime status on Yarn/K8s and triggers start/stop actions.
MetaData Service – stores all job definitions, table catalogs and version information.
Intelligent Monitoring – aggregates metrics and logs from every module for alerting and analysis.
Design Principles
Usability : a unified SQL IDE enables job creation, submission and lifecycle management.
Availability : all components run in HA mode; a single instance failure does not affect the service.
Scalability : the Gateway is abstracted with plug‑in deployers, supporting Yarn, Kubernetes and multiple Flink versions.
Real‑Time Development Workflow
Development – developers write SQL or JAR jobs in the IDE; the IDE sends the definition to the API, which validates it and persists it in MetaStore.
Debugging – the API validates SQL syntax and table permissions, then the Gateway compiles the job. Compilation results are returned to the IDE for iterative fixing.
Submission – after validation the API forwards the job to the Gateway; the Gateway selects the appropriate plug‑in (Yarn or K8s) and submits the job to the cluster.
Status Maintenance – the Backend periodically compares the logical status in MetaStore with the actual runtime status from Yarn/K8s, and automatically restarts or terminates jobs to keep the recorded state consistent.
SQL IDE
The IDE shows job metadata (available tables, libraries), provides a SQL editor with formatting, auto‑completion and version management, and displays debugging results. Over 3,000 jobs run on the platform, >80 % of which are developed via SQL.
Job Diagnosis System
The diagnosis subsystem collects metrics and logs throughout a job’s lifecycle, correlates them with metadata, and stores analysis results in a relational database and Elasticsearch for traceability.
When a job restarts, the system first queries Flink’s REST API for detailed exception information. If the REST endpoint is unavailable, it falls back to log data collected by LogAgent on each Yarn node. The diagnosis result (including root‑cause analysis and optimization suggestions) is written to the DB and indexed in ES for later retrieval.
Link Monitoring
Data flow: OBUS → Kafka → Flink. Four timestamps are recorded: server_time – OBUS receives the raw event. parse_time – OBUS finishes preprocessing. kafka_timestamp – Kafka stores the message. process_time – Flink consumes the record.
Custom metrics compute the latency of each stage and report them to the monitoring platform. Alerts can be configured per stage to enable rapid fault localization.
Real‑Time SLA Dashboard
The SLA dashboard combines business‑defined latency tolerances with the measured delays from link monitoring to calculate an on‑time rate for each job. Jobs that miss the SLA are highlighted for further diagnosis.
Application Practices
Real‑Time Data Warehouse
Source systems (applications, MySQL, Oracle, etc.) write change events to Kafka. Flink reads the streams, performs splitting, cleaning and aggregation to produce ODS, DWD and final business tables. This pattern is used across most internal services.
Real‑Time Dashboards for E‑commerce Events
Two typical pipelines are used for high‑traffic events (e.g., 618, Double‑11):
Classic pipeline : MySQL → Canal → Kafka → Flink → DB. This mature stack provides stable monitoring and alerting.
Flink CDC pipeline : Flink reads MySQL binlog directly, shortening the path and reducing latency, but it is still being stabilized for complex aggregations.
The classic pipeline is currently preferred for its proven reliability.
Future Plans
Lakehouse Integration with Apache Iceberg
OPPO plans to adopt Iceberg for a unified lake‑warehouse architecture, reducing data movement and storage costs. Kafka‑to‑Iceberg and CDC‑to‑Iceberg pipelines are already operational; work is in progress to enable Flink to read directly from Iceberg tables.
Cloud‑Native Scheduling on Kubernetes
The platform will evolve from Yarn‑only scheduling to Kubernetes, allowing elastic scaling and mixed‑workload scheduling alongside other online services. The Job Gateway will load a Kubernetes plug‑in and support per‑job SQL submissions similar to Yarn.
Q&A
Q1: How is Kafka table metadata managed? Two approaches are used: a legacy MySQL catalog for existing jobs and a Flink‑Hive (HMS) catalog for newer jobs. Both coexist during migration.
Q2: How to add a new field to a Kafka table? Update the table definition in the UI (for JSON format) and redeploy all dependent jobs so the new schema takes effect.
Q3: How are dimension tables joined with sharded MySQL tables? Dimension data is kept in a single table; if sharding is required, a UNION of the shards is performed before the join.
Q4: How is Kubernetes cloud‑native support being implemented? K8s is in the research phase. The plan is to adapt the Application mode to support per‑job SQL submissions, mirroring Yarn’s per‑job capability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
