Big Data 18 min read

How OPPO Built a Real‑Time Data Warehouse with Flink SQL

This article details{32-64 words} OPPO's evolution from an offline data warehouse to a real‑time platform, describing the business scale, data‑mid platform architecture, migration strategy using Flink SQL, extensions like AthenaX, and practical use cases such as real‑time ETL, CTR calculation, and tag import.

dbaplus Community

Jan 14, 2020

How OPPO Built a Real‑Time Data Warehouse with Flink SQL

Background and Data Platform

OPPO’s ecosystem (ColorOS, app store, browser, etc.) generates >100 PB of data and >200 TB per day. To manage this volume, OPPO built a four‑layer data‑mid platform: (1) a unified tool chain covering ingest‑govern‑develop‑consume; (2) a classic data‑warehouse with raw, detail, summary, and application layers; (3) a global data‑asset layer (ID‑mapping, user tags); and (4) scenario‑driven data products.

Offline Warehouse Architecture

Data from devices, logs, and databases is ingested by a high‑throughput Apache NiFi pipeline into HDFS (raw layer). Hourly Hive ETL jobs create the detail layer; daily Hive jobs aggregate to the summary layer. The application layer serves reports, user profiles, and APIs, while the detail layer also supports ad‑hoc queries via Presto.

Motivation for Real‑Time Warehouse

Batch‑oriented T+1 processing caused load spikes, delayed data‑quality monitoring, and could not satisfy real‑time reporting, tagging, and API needs. OPPO therefore pursued a real‑time warehouse.

Smooth Migration Strategy

The logical API (tables, SQL + UDF) remains unchanged; only the runtime swaps: Hive → Flink, HDFS → Kafka. By adapting the existing offline pipeline, a real‑time pipeline is obtained.

Why Flink SQL

Flink SQL provides low latency, high throughput, exactly‑once semantics, fault‑tolerant state, and native window/event‑time support. It offers ANSI‑SQL compliance, rich data types, custom source/sink via UDFs, and unified batch‑stream processing.

Metadata Management and Job Orchestration

OPPO built a web‑IDE similar to Hue and adopted Uber’s open‑source AthenaX as a REST‑based SQL manager. AthenaX abstracts job submission (JobStore ↔ YARN) and metadata injection (ExternalCatalog, TableCatalog) into Flink.

Table registration uses TableDescriptor (Connector, Format, Schema) which is matched to a TableFactory via Java SPI. Table metadata is stored in MySQL (basic info, connector, format, schema tables). A custom ExternalCatalog reads these entries, builds the required key‑value map, and registers tables with Flink.

Dimension Table Joins

UDF‑based approach: A TableFunction (e.g., UserDimFunc) loads the entire dimension from MySQL into an in‑memory cache and performs lookups in eval(). Users must rewrite SQL with LATERAL TABLE calls.

SQL‑rewrite approach: Before Flink compiles the query, a parser detects dimension tables and rewrites the join into a stream‑flatMap transformation. A RichFlatMapFunction loads and caches the dimension, enabling generic joins without user‑level UDF code.

Real‑Time Application Cases

Real‑Time ETL Splitting: Raw SDK logs (tens of TB per day) are filtered into business‑specific tables via Flink SQL. Flink optimizes the plan to read the Kafka source only once despite multiple downstream tables.

Real‑Time CTR Statistics: Tumbling windows compute impressions and clicks, derive CTR, and write results to MySQL for downstream dashboards.

Real‑Time Tag Import: User GPS data is aggregated (latest location per 5‑minute window) using an AggregateFunction and written to Elasticsearch for targeting.

Future Directions

End‑to‑end real‑time stream processing where a single SQL submission drives ingestion, computation, and assetization (e.g., auto‑creating Druid data sources and report templates).

Lineage tracing from source channels through real‑time tables/jobs to downstream products.

Unified management layer integrating offline and real‑time warehouses.

Overall, OPPO transitioned from a massive offline data lake to a Flink‑SQL‑driven real‑time data warehouse by preserving the table‑SQL API, replacing Hive with Flink, HDFS with Kafka, and extending Flink with custom metadata catalogs and dimension‑join mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Flink SQL Streaming ETL Real-Time Data Warehouse metadata management

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.