Big Data 17 min read

OPPO Commercial Data System Construction Practice: Platform, Ingestion, Development, Governance, and Analytics

This article presents OPPO's commercial data system construction practice, covering the data platform strategy, ingestion pipelines, development efficiency toolkits, data validation, visualization aids, UDF principles, warehouse architecture, metric systems, dimensional modeling, ETL optimization, governance metadata, quality management, monitoring, attribution services, analytics reporting, and a Q&A session.

DataFunSummit

May 29, 2022

OPPO Commercial Data System Construction Practice: Platform, Ingestion, Development, Governance, and Analytics

Guest speaker Qiu Shengchang, OPPO commercial data R&D leader, shared the practice of building OPPO's commercial data system, which is organized into six vertical sections: data platform, data ingestion, data development, data governance, data application, and data analysis.

Data Platform : The commercial data team uses the company‑provided data platform as a foundation, following the principle “use if available, don’t wait; migrate if possible; contribute if you can”. The team prefers platform capabilities, builds custom solutions when needed, and contributes back to the platform when those capabilities become available.

Development‑Efficiency Toolkit : High‑frequency developer actions (e.g., connecting to Hive/Spark, fetching max partitions, batch partition operations) are abstracted into tools that reduce manual steps, improve experience, prevent errors, and record metadata for audit and permission governance. The login‑assist tool provides a unified entry, auto‑fills credentials, and logs ownership information for each table.

Data‑Validation Toolkit : To ensure completeness, consistency, and correctness, validation tools compare tables, monitor fluctuations, and generate automatic alerts with clear responsibility and impact information, helping quickly locate and resolve issues.

Visualization Toolkit : Email and the company’s instant‑messaging tool (TT) are used to push data to users. Public accounts for quality management, revenue monitoring, BI, and decision support are created, and robot accounts automatically post announcements or alerts in relevant TT groups.

UDF Development Principles : While the platform provides generic UDFs, custom needs are built in‑house. Principles include using temporary functions for quick deployment, avoiding “backend thinking” by returning NULL on data anomalies instead of throwing exceptions, and handling data quality in validation steps rather than in UDFs.

Data Ingestion : The goal is a fully configurable, middleware‑based, stream‑batch unified ingestion layer. Key practices include mandatory volume checks at source ingestion, ingesting all fields to avoid repeated schema changes, retaining historical partitions in the buffer layer, and using a four‑type SRC/DST model for data movement.

Data Development :

Warehouse architecture based on dimensional modeling, supporting offline data warehouses and real‑time systems for revenue monitoring, experiments, and ad serving.

Metric system construction varies by business domain; advertising metrics are built around conversion tracking, with core KPIs derived from OSM models.

Dimensional modeling follows a star schema, ensuring consistent dimensions across models for cross‑analysis.

ETL chain optimization includes three design patterns: fully independent report models (high efficiency, high complexity), unified model layer (simple structure, potential contention), and a hybrid decoupled approach that balances complexity and performance.

Data Governance :

Metadata is classified into business and technical categories to support various roles.

Technical metadata (Hive/DB) is used for schema checks and quality monitoring.

Scheduling metadata drives quality control of data pipelines.

Quality management combines platform tools with custom solutions, covering un‑published monitoring, dependency checks, and fluctuation alerts.

Data Application :

Fluctuation monitoring addresses three pain points—alarm fatigue, low accuracy, and threshold configuration—by using a three‑condition method (ring‑ratio, day‑over‑day, week‑over‑week) and amount‑based tiered thresholds.

Ad attribution service links conversion events to ad clicks through a three‑layer architecture (data layer, strategy layer, service layer), handling ID mapping, standardization, anti‑fraud, and rule‑based attribution.

Data Analysis :

Data extraction is empowered through training, enabling self‑service extraction and simple analysis.

Micro‑analysis reports provide concise daily dashboards with revenue, efficiency, and fluctuation insights for quick business decisions.

Q&A :

Metric naming must be unique and follow system‑provided conventions.

Stream‑batch integration currently exists only at the ingestion layer (MySQL → Hive) with a single engine.

Code scanning can be implemented via custom regex rules on Hive‑stored code fields.

Thank you for listening.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Analytics Data Platform

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.