How to Build a Reusable, Well‑Designed Data Warehouse Model
This article analyzes why analysts and data engineers clash over non‑reusable data models, presents metrics such as cross‑layer reference rate and model reuse coefficient, and outlines a step‑by‑step framework—including ODS takeover, subject‑domain mapping, dimension consistency, fact‑table integration, development best practices, and tool support—to transform siloed warehouses into a shared data‑platform.
Background and Problem
Most analysts in companies combine business needs with large‑scale data analysis, but because they are often non‑technical they write poor‑quality SQL (sometimes more than five levels of nesting). Such queries consume excessive resources, block queues, and cause friction with data engineers, who may revoke raw‑data access while analysts complain about incomplete data and long waiting times.
Root Cause
The conflict originates from data models that cannot be reused; data engineers work in a "chimney" style where each new requirement triggers recomputation from raw data, leading to high latency and duplicated effort.
Current State Analysis
Statistics derived from the metadata center illustrate the issue:
Table 1 shows that 2,547 of 6,049 tables (≈40 %) lack layer identification, making them essentially non‑reusable.
In identified layers, ODS:DWD:DWS:ADS read tasks are 1,072 : 545 : 187 : 433, meaning ODS tasks account for 47.9 % of total reads, indicating heavy reliance on raw data.
Table 2 reveals that 37.8 % of queries hit ODS directly, while DWD, DWS, and ADS layers are severely under‑utilized, causing larger scan volumes, longer query times, and lower user satisfaction.
Key Metrics for Model Quality
Three dimensions form the measurement system:
Completeness : measured by the proportion of ODS tables referenced by downstream DWS/ADS/DM layers (cross‑layer reference rate). A lower rate is better; ideally ODS tables are only referenced by DWD.
Reusability : captured by the model reuse coefficient – the average number of downstream models generated per upstream model. Values below 2 indicate poor reuse; above 3 are acceptable.
Standardization : assessed by naming consistency, subject‑domain attribution, and dimension attribute alignment.
Design Principles and Step‑by‑Step Framework
Take over the ODS layer : control source data, enforce a naming convention such as ods_<em>system</em>_<em>table</em>, and ensure one‑to‑one mapping with source tables.
Define subject domains and build a bus matrix : abstract business processes into domains (e.g., warehousing, trading) and map analysis dimensions to each domain.
Construct consistent dimension tables : create global dimension tables (e.g., dim_<em>domain</em>_<em>description</em>) to avoid duplicated attributes and ensure uniform naming across models.
Integrate fact tables with consistent grain : keep statistical granularity uniform within a fact table; do not mix product‑level and warehouse‑level granularity.
Model development best practices :
Configure explicit task dependencies to avoid downstream jobs running on incomplete data.
Delete temporary tables at task end to free space.
Align task names with table names for traceability.
Manage lifecycle: retain full history for ODS/DWD, set 7‑30 day TTL for DWS/ADS/DM.
Compress DWD tables (e.g., LZO) to reduce storage.
Application migration : perform data comparison to ensure consistency before switching applications, then retire legacy tables.
Tool Support
The EasyDesign modeling tool, built on metadata‑center APIs, automates the above workflow: it calculates completeness, reuse, and standardization metrics, manages dimension/measure dictionaries, and enforces approval processes for model changes.
Conclusion
Completeness, reusability, and standardization constitute a quantitative framework for evaluating data‑platform models. Proper dimension design is the foundation of a shared data middle platform, while fact‑table grain consistency prevents analytical errors. Building such a platform typically takes six months to a year, but it can shrink average data‑request delivery from a week to three days, dramatically boosting operational efficiency.
Big Data Tech Team
Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
