Big Data 57 min read

Mastering Big Data Modeling: From ER and Dimensional to Data Vault and Alibaba’s OneData

This comprehensive guide explains why data modeling is essential for big‑data systems, compares relational and OLAP approaches, details ER, dimensional, Data Vault and Anchor methodologies, and walks through Alibaba’s multi‑stage data‑model practice, integration framework, dimension design, fact‑table strategies and aggregation techniques.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Mastering Big Data Modeling: From ER and Dimensional to Data Vault and Alibaba’s OneData

Chapter 1 Big Data Modeling Overview

1.1 Why data modeling is needed

Structured classification, organization and storage are a major challenge.

Data models emphasize reasonable storage from business, access and usage perspectives.

Modeling balances performance, cost and efficiency.

1.2 Relational Database Systems and Data Warehouses

1.3 Choosing a modeling methodology based on OLTP vs OLAP

OLTP focuses on random read/write, usually adopts 3NF entity‑relationship models to solve redundancy and consistency.

OLAP focuses on batch read/write, cares about data integration and performance for complex large‑scale queries, requiring different modeling methods.

1.4 Typical Data Warehouse Modeling Methodologies

1.4.1 ER Model

Goal: integrate data across systems, group by business themes, ensure consistency for analysis, though not directly usable for decision‑making.

Typical example: Teradata’s FS‑LDM for financial services, abstracting business into ten themes.

1.4.2 Dimensional Model

Built from analysis needs, focuses on fast query performance; typical star schema and snowflake schema.

Design steps:

Select business processes for analysis (e.g., payment, refund, balance).

Choose grain (level of detail).

Identify dimension tables and attributes for grouping/filtering.

Select facts (metrics) to measure.

1.4.3 Data Vault Model

Emphasizes an auditable base layer, preserving history, traceability and atomicity without excessive consistency processing.

Organizes enterprise data by subject and adds further normalization for extensibility.

1.4.4 Anchor Model

Further normalizes Data Vault; designed for high scalability where extensions are added, not modified, pushing the model to 6NF (key‑value structure).

1.5 Alibaba Data Model Practice Overview

Phase 1: Built on Oracle, data served reporting needs.

Phase 2: Introduced Greenplum MPP architecture (ODL, BDL, IDL, ADL). Attempted ER modeling but faced rapid business changes, personnel turnover and incomplete domain knowledge, making ER risky.

Phase 3: Adopted Hadoop and MaxCompute; used Kimball dimensional modeling as core, constructing a public‑layer data architecture.

Chapter 2 Alibaba Data Integration and Management System

Efficient modeling and systematic storage are required to handle explosive data growth, avoid duplication, ensure consistency and maintain standards.

2.1 Overview

Core: from business architecture design to model design, from data development to data service, achieving manageable, traceable and non‑redundant data.

2.1.1 Position and Value

Build unified, standardized ODS (operational data store) and DWD/DWS (detail and summary layers). Through data services and products, support Alibaba’s big‑data system – the public data layer.

Business partitions are defined by business attributes with minimal overlap.

Standard definitions provide a naming system used in model design.

Model design is based on dimensional theory, building consistent dimensions and facts on a dimensional bus architecture.

2.2 Standard Definitions

Based on dimensional modeling, a bus matrix defines data domains, business processes, dimensions, measures/atomic metrics, modifier types, modifier words and time cycles.

2.2.1 Terminology

Data domain (subject domain): abstract collection of business processes or dimensions for analysis. Common domains: user, channel, marketing, traffic, transaction, finance, product.

Business process: indivisible business event (order, payment, refund).

Time cycle: defines statistical period (last 30 days, natural week, current day).

Modifier type: abstract classification of modifier words (e.g., device type).

Modifier word: business scenario qualifier attached to a modifier type.

Measure/Atomic metric: indivisible metric with clear business meaning (e.g., payment amount).

Dimension: environment for measures, such as geographic or time dimensions.

Dimension attribute: column belonging to a dimension, used for grouping, filtering and labeling.

Derived metric = atomic metric + optional modifier words + time cycle + grain.

2.3 Model Design

2.3.1 Guiding Theory

Dimension design follows dimensional modeling, building consistent dimensions and facts on a bus architecture.

2.3.2 Model Layers

ODS: stores operational data with minimal transformation.

CDM (public dimension layer): stores detailed fact data, dimension tables and aggregated metrics. CDM is split into DWD (detail) and DWS (summary) layers, using dimensional techniques and dimension degeneration to reduce joins and improve usability.

ADS (application data layer): stores personalized product metrics derived from CDM and ODS.

Principles applied: high cohesion, low coupling, cost‑performance balance, data rollback, consistency, clear naming, and understandable table names.

2.4 Model Implementation

Four stages for dimensional modeling:

High‑level model design – define scope, produce high‑level diagram.

Detailed model design – add attributes and measures, define sources.

Review, redesign and validation.

Produce detailed design documents for ETL development.

OneData implementation follows an iterative spiral process: after architecture, iterate model design and review per data domain, using the OneData tool for metric definition and model design.

2.5 Dimension Design

2.5.1 Basic Concepts

Facts are measures; dimensions describe the environment for analysis.

Dimension attributes are columns used for constraints, grouping and labeling.

Primary keys ensure referential integrity.

2.5.2 Basic Design Method

Select or create a dimension, ensure uniqueness.

Identify the primary dimension table (usually ODS).

Identify related dimension tables and choose attributes.

Choose dimension attributes (first from primary table, then from related tables).

Guidelines: generate rich attributes, provide meaningful textual descriptions, distinguish numeric attributes from facts.

2.5.3 Consistent and Cross‑Exploratory Dimensions

Shared dimension tables (e.g., product, seller, buyer) enable cross‑exploration.

Consistent roll‑up where one dimension’s attributes are a subset of another.

Cross attributes allow partial overlap (e.g., category attribute in product and seller dimensions).

2.6 Advanced Dimension Topics

2.6.1 Dimension Integration

Unify naming, field types, code values, and consolidate tables with high cohesion and low coupling.

Vertical integration (same data set, different storage) and horizontal integration (different data sets, possible overlap) with deduplication and key conflict handling.

2.6.2 Horizontal Splitting

Design based on extensibility, performance, and usability.

Choose split strategy according to attribute variance and business correlation.

2.6.3 Vertical Splitting

Separate stable, early‑produced, high‑traffic attributes (master dimension) from fast‑changing, late‑produced, low‑traffic attributes (extension dimension).

2.6.4 Historical Archiving

Three strategies: front‑end archiving logic, binlog‑based archiving, and custom warehouse archiving.

2.6.5 Slowly Changing Dimensions

Overwrite – keep only latest value.

Insert new rows – preserve history.

Add new columns – keep both old and new values.

2.6.6 Snapshot Dimensions

Alibaba prefers daily full snapshots instead of surrogate keys or link tables, accepting storage cost for simplicity.

2.6.7 Extreme Storage

Transparent layer hides historical link tables; monthly link tables reduce overhead but have limitations.

2.6.8 Micro‑Dimensions

Extract unstable attributes into separate key‑value tables; not widely used due to enumeration limits and ETL complexity.

2.6.9 Special Dimensions

Recursive hierarchies (balanced vs unbalanced) – flatten or bridge tables.

Behavior dimensions – derived from facts (e.g., last visit time, cumulative spend).

Multi‑value dimensions – handle many‑to‑many relationships via bridge tables or duplicated rows.

Miscellaneous dimensions – combine indicator fields into a single dimension table.

2.7 Fact Table Design

2.7.1 Fact Basics

Fact tables capture business processes via measures, reference dimensions, and define grain.

Facts can be additive, semi‑additive, or non‑additive.

Degenerate dimensions store dimension attributes directly in the fact table.

Three fact types: transaction, periodic snapshot, cumulative snapshot.

2.7.2 Design Principles

Include all relevant facts.

Exclude unrelated facts.

Decompose non‑additive facts into additive components.

Declare grain before choosing dimensions and facts.

Keep grain consistent within a fact table.

Maintain consistent units.

Handle NULLs (often replace with zero).

Use degenerate dimensions to improve usability.

2.7.3 Transaction Fact Tables

Single‑transaction tables: one business process per table (e.g., order creation).

Multi‑transaction tables: multiple processes in one table, either separate columns per process or a process label column.

Alibaba’s practice shows both approaches; single‑transaction tables are clearer, multi‑transaction tables save storage but increase complexity.

2.7.4 Periodic Snapshot Fact Tables

Capture state at regular intervals (daily, monthly).

Useful for status metrics (balance, rating) and non‑additive values.

2.7.5 Cumulative Snapshot Fact Tables

Track lifecycle events and time intervals between processes (order creation → payment → shipment).

Store multiple date fields; rows are updated as the lifecycle progresses.

2.7.6 Aggregated Fact Tables (DWS)

Public summary layer for pre‑aggregated metrics (e.g., seller 1‑day sales, N‑day sales, yearly sales).

Principles: consistency with detail data, avoid mixing aggregation levels in one table, allow different aggregation grains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibabadata modelingData Warehousedimensional modeling
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.