Master Data Warehouse Architecture: Layers, Naming Rules, and Lifecycle Tips
This guide outlines a comprehensive data model architecture for data warehouses, detailing layer definitions (ODS, CDM, DWD, DWS, ADS), naming conventions, data type standards, partitioning strategies, redundancy rules, and lifecycle management policies to ensure consistency, performance, and maintainability across big‑data environments.
1. Data Model Architecture Specification
The architecture is divided into five logical layers:
ODS (Operational Data Store) : Mirrors source system data, serves as a preparation zone for the DMP.
CDM (Common Data Model) : Includes DWD (detail) and DWS (summary) layers, responsible for data integration, dimension consistency, and reusable analytical fact tables.
DWD (Data Warehouse Detail) : Stores fine‑grained detail data.
DWS (Data Warehouse Summary) : Stores aggregated summary data.
ADS (Application Data Service) : The application‑level data layer.
2. Data Division and Namespace Convention
Names should combine business abbreviations with layer identifiers, providing clear references for projects, tables, and fields. Recommendations include naming by business, by data domain, and by business process (e.g., user_login for a login event).
3. Data Model Principles
High cohesion & low coupling : Group related data together and separate unrelated data.
Separate core and extension models : Core fields support common business; extensions handle specialized needs without polluting the core.
Encapsulate common logic in lower layers : Avoid exposing shared logic to the application layer.
Balance cost and performance : Limited redundancy can improve query speed, but excessive duplication should be avoided.
Rollback‑safe processing : Re‑running jobs must produce identical results.
Consistency : Identical fields across tables must share the same name.
Clear, understandable naming : Table names should be self‑explanatory for downstream users.
4. Public Standards
4.1 Layer Call Convention
Application layers must call data from the DW public layer, never bypass the middle layer to read directly from ODS. ODS data should be accessed via CDM views, and CDM tasks should be limited to ten layers deep. Typically, one task produces one output table.
4.2 Data Type Specification
ODS inherits source system types (e.g., MySQL conversion rules). CDM adopts ODS types for inherited fields. Derived fields follow these rules:
Monetary and decimal values → DOUBLE String values → STRING ID and integer values → BIGINT Time values → STRING (or DATETIME if required)
Status fields →
STRING4.3 Public Field Definition
Standard partition fields:
Daily partition: dt(YYYYMMDD) Hourly partition: hh(00-23) Minute partition: mi(00-59) Boolean flag: is_{business} with values Y / N (no nulls)
4.4 Data Redundancy
Redundant fields should be frequently accessed (≥3 downstream SQLs) and must not cause significant refresh latency. Duplicate fields with >60% overlap are discouraged; use joins instead.
4.5 Data Splitting
Horizontal and vertical splitting is based on access hot‑spots and sparsity of values. Core vs. extension fields are split vertically; high‑frequency columns stay together; low‑frequency columns are separated. Horizontal splitting can use secondary partitions to avoid excessive data copying.
4.6 Null‑Value Handling
Summary metrics: replace nulls with zero.
Dimension attributes: use -99 to indicate unknown.
5. Data Storage and Lifecycle Management
Retention policies are based on the maximum access span within three months:
≤4 days → keep 7 days
≤12 days → keep 15 days
≤30 days → keep 33 days
≤90 days → keep 93 days
≤180 days → keep 183 days
≤365 days → keep 368 days
6. CDM Public Dimension Layer Design
6.1 Design Principles
Maintain consistent field names, types, and content across all physical tables. Exceptions are allowed for aliasing when different business roles require distinct names.
6.2 Dimension Combination & Splitting
Combine highly related attributes in one dimension table.
Separate unrelated attributes into miscellaneous dimensions or keep them in fact tables.
Behavioral dimensions are derived from aggregated metrics and can be redundantly stored in dimension tables.
6.3 Naming Convention
{project}.dim{business/pub}{dimension}[_{custom}]7. CDM Detail Layer Design
7.1 Naming Convention
{project}.dwd{business}{domain}{process}[{custom}]{refresh}{i/f}where i = incremental, f = full.
7.2 Storage & Lifecycle
Detail tables are fact tables stored with daily partitions. Transactional fact tables are retained permanently; snapshot fact tables follow the same retention rules as in section 5.
7.3 Transactional Fact Table Guidelines
Design around specific business events.
Use event timestamp as partition field.
Redundant subsets reduce I/O for upstream queries.
Degenerate dimensions into the fact table to avoid costly joins.
7.4 Snapshot Fact Tables
Used for state or stock analysis; sampled at regular intervals.
7.5 Cumulative Snapshot Fact Tables
Combine multiple business processes (e.g., purchase order flow) to analyze intervals such as payment‑to‑shipment time.
8. CDM Summary Layer Design
8.1 Naming Convention
{project}.dws{business}{domain}{granularity}[{custom}]{period}{i/f}8.2 Storage & Lifecycle
Summary tables are fact tables with daily partitions, following the same retention policy as section 5.
9. Table Design Standards
Good table design reduces storage, computation, and maintenance costs. Key goals:
Lower storage cost by minimizing redundant data.
Lower compute cost by enabling partition pruning and reducing unnecessary reads.
Lower maintenance complexity through consistent naming and clear partitioning.
9.1 Design Steps
Identify project space and business process to decide table type.
Define table description, permissions, and owner.
Choose partitioned vs. non‑partitioned based on data volume and integration patterns.
Define fields and partition columns.
Create/convert the table.
Clarify data import scenarios (batch, streaming, incremental).
Set table and partition lifecycle policies.
9.2 Partition Design Rules
Support adding new partitions but not new partition columns.
Maximum of 60 000 partitions per table.
Multi‑level partitions require all level values to be specified on addition.
Do not rename partition columns; only values can change.
9.3 Partition Field Selection
Prefer time‑based fields (day, hour, minute) and high‑cardinality fields used frequently in WHERE, GROUP BY, or DISTINCT. Avoid fields with >10 000 distinct values that are rarely queried.
select … from ** table where id = 123 and ** …;10. Data Import Channels and Table Design
Supported channels include DataHub, DataX, DTS, Console upload, and SDK INSERT INTO statements. All channels must respect a commit size of roughly 64 MB to avoid small‑file issues.
11. Partition‑to‑Storage Mapping
Each primary partition contains secondary time‑based partitions; each secondary partition stores all columns. Guidelines include limiting the number of partitions, avoiding tiny partitions, and ensuring partition choices facilitate efficient query pruning.
12. General Table & Partition Naming Rules
Unified naming must indicate business type, table purpose (fact, dimension, log, etc.), and entity.
Fields with identical business meaning must share the same data type.
Partition design should prioritize query patterns, avoid skew, and keep partition counts reasonable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
