Data Warehouse vs. Database: Core Differences and Building a Data Platform
This article explains what a data warehouse is, contrasts it with traditional databases, outlines how to design and build a data warehouse—including model selection, topic domain division, bus matrix, layered architecture, and data governance—then expands to the concept of a data middle platform and its distinction from data lakes and big‑data platforms.
1. What Is a Data Warehouse?
A data warehouse is a subject‑oriented, integrated, relatively stable collection of historical data designed to support management decision‑making. It integrates heterogeneous data sources, reorganizes them by subject, and stores data that is generally read‑only.
2. Data Warehouse vs. Database
Purpose & Design: Databases handle transactional processing with frequently updated data; data warehouses focus on analytical processing of integrated, historical data.
Usage: Databases store current transactional data (e.g., sales orders); data warehouses store historical data for reporting.
Design Paradigm: Databases follow normalization (3NF) for insertion efficiency; data warehouses often denormalize to optimize query performance.
3. How to Build a Data Warehouse
Model Choice: Flexible; not limited to a single modeling method.
Data Orientation: Driven by real‑world business scenarios.
Design Principles: Flexibility, scalability, technical reliability, and cost‑effectiveness.
Research: business, requirements, and data.
Define subject areas: determine domains based on research.
Construct bus matrix and dimensional models.
Design layered architecture.
Implement models.
Data governance.
4. What Is a Data Middle Platform?
A data middle platform unifies data collection, computation, storage, and processing, standardizes data definitions, and provides reusable data services (e.g., APIs) that directly support business operations, reducing duplicate development and siloed solutions.
5. Key Distinctions Among Data Platform, Data Warehouse, Data Middle Platform, and Data Lake
Data Platform: Provides compute and storage capabilities.
Data Warehouse: Uses platform capabilities to store subject‑oriented data tables under a methodology.
Data Middle Platform: Packages platform and warehouse functions into a productized, integrated service (often exposed via APIs).
Data Lake: Large repository for raw structured and unstructured data, serving as a source for warehouses.
Overall, the middle platform is business‑centric, offers stronger data reuse, and delivers faster services by building on top of the warehouse and platform.
6. Related Big‑Data Systems
Data Warehouse Design Center: Theme‑driven, layered design using dimensional modeling.
Data Asset Center: Manages data assets, lineage, and access heat.
Data Quality Center: Monitors and validates data to catch issues early.
Metric System: Handles metric definitions, calculations, and governance.
Data Map: Provides metadata indexing, dictionary, lineage, and feature queries.
7. Building a Data Middle Platform
Assess current state: business, data, IT, organization.
Define architecture: business, technical, application, organizational.
Construct assets: unified warehouse layer, tag layer, application layer.
Utilize data: output and apply data services.
Operate continuously: iterate and improve.
Successful implementation requires top‑down leadership and cross‑functional execution.
8. Core Priorities of a Data Warehouse
Data Integration: Consolidate heterogeneous sources into a consistent view for analysis.
Data Quality: Ensure accuracy and reliability; poor quality data undermines trust.
9. Conceptual, Logical, and Physical Models
Conceptual Model (CDM): High‑level business view, defines entities and relationships without attributes.
Logical Model (LDM): Detailed blueprint based on business rules, includes attributes, primary/foreign keys, and normalization.
Physical Model (PDM): Implements the logical model in a specific technology, defining tables, columns, indexes, and possible denormalization.
10. Slowly Changing Dimensions (SCD)
Overwrite: Replace old data without history.
Add new row (vertical expansion): Use surrogate key with effective/expiry timestamps.
Add two columns (horizontal expansion): Store previous and current values to keep limited history.
11. Understanding Metadata
Business Metadata: Describes data meaning, subject definitions, business logic, standard metrics, and dimensions.
Technical Metadata: Includes source details (IP, port, type), ETL processes, data cleaning rules, and processing logic.
Management Metadata: Covers governance processes, organizational roles, and responsibilities.
12. Determining Subject Domains
Subject domains group related data for analysis and are defined based on business processes, stakeholder needs, functional areas, or departmental boundaries.
13. Controlling Data Quality
Validation mechanisms (e.g., daily row counts).
Content comparison and sampling.
Monthly full‑load audits.
14. Modeling Approaches: Top‑Down vs. Bottom‑Up
Bill Inmon advocates a top‑down, data‑driven approach (enterprise‑wide integration). Ralph Kimball promotes a bottom‑up, business‑driven approach (focus on specific analytical needs).
15. Why Model a Data Warehouse?
Proper modeling ensures consistent, performant, cost‑effective data structures, facilitates cross‑department reporting, reduces redundancy, and improves user efficiency.
16. Modeling Methods
Dimensional Models: Star, Snowflake, Constellation.
Normalization (3NF) Model: High‑level ER design, suitable for upstream data.
Data Vault: Hub‑Link‑Satellite architecture for integration.
Anchor Model: 6NF, highly extensible K‑V structure (rarely used).
17. Why Layer a Data Warehouse?
Clarifies data structure and lineage.
Facilitates reuse of intermediate data.
Reduces duplicate computation.
Simplifies complex problems.
Isolates raw data anomalies from downstream tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
