Big Data 32 min read

Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

The article outlines comprehensive data‑engineering and warehouse‑design principles—covering collection (four Ws and methods like SDK, point‑code, binlog), reporting strategies, source selection, modeling with fact, aggregation, dimension and model tables, quality checks, and governance practices such as standardized SDKs, metric libraries, automated lineage, and cost optimization—to share actionable experience for any organization.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Data Engineering and Data Warehouse Design: Principles, Practices, and Governance

Data engineering focuses on what data engineers do and how to design and develop a good data warehouse. The article shares experiences from content, e‑commerce, and community e‑commerce businesses.

Data Collection

Data originates from data collection and reporting. Collection covers the four W’s (When, Where, Who, What). When includes operation time, collection time, interaction time, and reporting time. Where is usually identified by IP or GPS. Who refers to user accounts and device IDs. What captures page views, exposures, clicks, and related business parameters.

Collection methods include:

Embedded (point) collection : developers add collection code at specific moments; low cost but can lead to inconsistent logic.

SDK collection : a unified SDK standardizes collection timing and parameters; higher upfront cost but ensures consistency.

BINLOG collection : captures every database change; no developer effort needed but downstream processing becomes complex.

Data Reporting

Reporting can be performed by the client (frontend) or the backend. Client reporting may lose data due to network issues or process termination; retry mechanisms can cause duplicates. Backend reporting is more reliable for critical actions.

BINLOG reporting integrates collection and reporting, often sending data to a message queue such as Kafka.

Data Source Selection

Raw business DB data is not directly used in the warehouse; it passes through collection and processing. For low‑latency, high‑accuracy scenarios, direct DB queries may be preferred, otherwise the warehouse provides richer analytics.

Data Modeling

Typical tables are classified into:

Fact tables (流水表) that record atomic user actions.

Light aggregation tables that standardize metric calculations and retain key deduplication fields.

Dimension tables (维表) that store reference data such as user profiles.

Model tables for specific use‑cases like user models or funnel analysis.

Data Quality

Quality is evaluated at the source and at the warehouse. Source quality follows a three‑level checklist: presence, format correctness, and business‑level correctness. Warehouse quality includes accuracy, timeliness, consistency, and usability.

Data Governance

Governance covers reporting governance, parameter governance, metric governance, process governance (DataOps), cost optimization, and value circulation. Key practices include standardizing collection SDKs, consolidating duplicated parameters, establishing a unified metric library, automating data lineage, monitoring anomalies, and pruning low‑usage tables.

Conclusion

The author summarizes years of experience in data reporting, warehouse architecture, batch‑stream integration, and data industrialization, offering a methodology that can be adapted to any organization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data engineeringBig DataData WarehouseETLData Governance
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.