Understanding Data Warehouses: Definitions, Differences, Architecture, Modeling, and Best Practices
This article explains what a data warehouse is, contrasts it with traditional databases, outlines how to design and build a warehouse—including model selection, subject‑area definition, bus matrix, layering, and data quality—while also covering related concepts such as data middle platforms, data lakes, metadata, and modeling techniques.
What is a data warehouse? A data warehouse is a subject‑oriented, integrated, relatively stable collection of historical data designed to support management decision‑making. It stores integrated data from heterogeneous sources, reorganized by subject, and the data is generally read‑only after loading.
Difference between data warehouse and database: Databases are transaction‑oriented, frequently updated, and follow strict normalization, whereas data warehouses are analytical, store historical data, use denormalized designs, and serve reporting and decision‑support needs.
How to build a data warehouse: The process is flexible and includes business, requirement, and data research; defining subject domains; constructing a bus matrix that maps facts to dimensions; designing layered architecture; implementing models; and establishing data governance.
Data middle platform (data‑mid‑platform): It unifies data collection, computation, storage, and processing, providing standardized data assets that reduce duplicate development, lower siloed costs, and enable fast, reusable data services for business.
Key distinctions among data platform, data warehouse, data middle platform, and data lake: The data platform offers compute and storage; the warehouse builds on it to store subject‑oriented tables; the middle platform packages both as productized services; the lake stores raw structured and unstructured data for downstream use.
Related systems: Data asset center, data quality center, indicator system, data map, and other components support governance, monitoring, and metadata management.
Steps to construct a data middle platform: Assess current business, data, and IT status; define business, technical, application, and organizational architectures; build unified data layers (raw, warehouse, tag, application); apply data, and continuously operate and iterate.
Critical factors for a data warehouse: Effective data integration and high data quality are essential; without consistent integration and reliable data, analysis and decision‑making are compromised.
Modeling layers: Conceptual model (CDM) captures user‑level entities and relationships; logical model (LDM) refines entities, attributes, keys, and relationships; physical model (PDM) translates the logical design into actual tables, columns, and indexes.
Slowly Changing Dimensions (SCD) handling: Common approaches include overwrite, adding a new row with effective dates or flags, and adding two columns (previous/current) to track limited history.
Metadata categories: Business metadata (data meaning, subject definitions, standards), technical metadata (source details, ETL processes, data structures), and management metadata (processes, roles, responsibilities).
Determining subject domains: Subjects are high‑level abstractions of data used for analysis; they can be defined by business processes, stakeholder needs, functional areas, or departmental boundaries.
Controlling data quality: Implement validation mechanisms, sample comparisons, and regular full‑load reviews to detect and correct issues early.
Modeling philosophies: Inmon advocates a top‑down, enterprise‑wide approach; Kimball promotes a bottom‑up, business‑driven method, each influencing model choice and implementation speed.
Why modeling is needed: Consistent models enable cross‑departmental reporting, improve performance, reduce redundancy, and support scalable analytics.
Data warehouse modeling methods: Dimensional models (star, snowflake, constellation), normalized (3NF) models, Data Vault (hub‑link‑satellite), and Anchor models each serve different scalability and flexibility needs.
Why layered architecture: Layering clarifies data structure, eases lineage tracking, promotes reusable intermediate layers, simplifies complexity, and isolates raw data anomalies from downstream processes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
