How OPPO Built a Real‑Time Data Warehouse with Flink SQL
This article details{32-64 words} OPPO's evolution from an offline data warehouse to a real‑time platform, describing the business scale, data‑mid platform architecture, migration strategy using Flink SQL, extensions like AthenaX, and practical use cases such as real‑time ETL, CTR calculation, and tag import.
Background and Data Platform
OPPO’s ecosystem (ColorOS, app store, browser, etc.) generates >100 PB of data and >200 TB per day. To manage this volume, OPPO built a four‑layer data‑mid platform: (1) a unified tool chain covering ingest‑govern‑develop‑consume; (2) a classic data‑warehouse with raw, detail, summary, and application layers; (3) a global data‑asset layer (ID‑mapping, user tags); and (4) scenario‑driven data products.
Offline Warehouse Architecture
Data from devices, logs, and databases is ingested by a high‑throughput Apache NiFi pipeline into HDFS (raw layer). Hourly Hive ETL jobs create the detail layer; daily Hive jobs aggregate to the summary layer. The application layer serves reports, user profiles, and APIs, while the detail layer also supports ad‑hoc queries via Presto.
Motivation for Real‑Time Warehouse
Batch‑oriented T+1 processing caused load spikes, delayed data‑quality monitoring, and could not satisfy real‑time reporting, tagging, and API needs. OPPO therefore pursued a real‑time warehouse.
Smooth Migration Strategy
The logical API (tables, SQL + UDF) remains unchanged; only the runtime swaps: Hive → Flink, HDFS → Kafka. By adapting the existing offline pipeline, a real‑time pipeline is obtained.
Why Flink SQL
Flink SQL provides low latency, high throughput, exactly‑once semantics, fault‑tolerant state, and native window/event‑time support. It offers ANSI‑SQL compliance, rich data types, custom source/sink via UDFs, and unified batch‑stream processing.
Metadata Management and Job Orchestration
OPPO built a web‑IDE similar to Hue and adopted Uber’s open‑source AthenaX as a REST‑based SQL manager. AthenaX abstracts job submission (JobStore ↔ YARN) and metadata injection (ExternalCatalog, TableCatalog) into Flink.
Table registration uses TableDescriptor (Connector, Format, Schema) which is matched to a TableFactory via Java SPI. Table metadata is stored in MySQL (basic info, connector, format, schema tables). A custom ExternalCatalog reads these entries, builds the required key‑value map, and registers tables with Flink.
Dimension Table Joins
UDF‑based approach: A TableFunction (e.g., UserDimFunc) loads the entire dimension from MySQL into an in‑memory cache and performs lookups in eval(). Users must rewrite SQL with LATERAL TABLE calls.
SQL‑rewrite approach: Before Flink compiles the query, a parser detects dimension tables and rewrites the join into a stream‑flatMap transformation. A RichFlatMapFunction loads and caches the dimension, enabling generic joins without user‑level UDF code.
Real‑Time Application Cases
Real‑Time ETL Splitting: Raw SDK logs (tens of TB per day) are filtered into business‑specific tables via Flink SQL. Flink optimizes the plan to read the Kafka source only once despite multiple downstream tables.
Real‑Time CTR Statistics: Tumbling windows compute impressions and clicks, derive CTR, and write results to MySQL for downstream dashboards.
Real‑Time Tag Import: User GPS data is aggregated (latest location per 5‑minute window) using an AggregateFunction and written to Elasticsearch for targeting.
Future Directions
End‑to‑end real‑time stream processing where a single SQL submission drives ingestion, computation, and assetization (e.g., auto‑creating Druid data sources and report templates).
Lineage tracing from source channels through real‑time tables/jobs to downstream products.
Unified management layer integrating offline and real‑time warehouses.
Overall, OPPO transitioned from a massive offline data lake to a Flink‑SQL‑driven real‑time data warehouse by preserving the table‑SQL API, replacing Hive with Flink, HDFS with Kafka, and extending Flink with custom metadata catalogs and dimension‑join mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
