Big Data 14 min read

Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem

This article explains the purpose of data as a strategic asset, compares traditional databases with data warehouses, outlines key characteristics and related concepts of data warehouses, and introduces the Hadoop ecosystem components that support large‑scale data storage and analysis.

Yanxuan Tech Team

Feb 17, 2020

Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem

0. Preface

Data is a valuable asset for every organization, serving two main purposes: preserving operational records and enabling analytical decision‑making. The quality assurance team uses quality data to objectively reflect development and testing workloads, efficiency, and quality, identify weak spots, and benchmark performance across time and teams.

1. Database & Data Warehouse

Both databases and data warehouses organize and manage data through a database system based on a data model, but they differ in purpose and design. Traditional databases (e.g., MySQL, Oracle) store transactional data and support online transaction processing (OLTP), focusing on response time, security, integrity, and concurrency.

A data warehouse is an integrated environment for analytical processing (OLAP). It does not generate or consume data; instead, it ingests data from external sources, often from multiple heterogeneous sources, and stores it in a format optimized for large‑scale queries and reporting.

Key differences:

Suitable Work : Data warehouse – analysis, reporting, big data; Database – transaction processing.

Data Source : Data warehouse – aggregated from many sources; Database – captured from a single source.

Data Capture : Data warehouse – batch writes per scheduled ETL; Database – continuous writes optimized for high‑throughput transactions.

Standardization : Data warehouse – loosely structured schemas; Database – highly standardized static schemas.

Storage : Data warehouse – column‑oriented for fast queries; Database – row‑oriented for high‑throughput writes.

Access : Data warehouse – optimized for minimal I/O and high throughput; Database – many small read operations.

Thus, data warehouses complement rather than replace traditional databases.

2. Characteristics of Data Warehouses

Subject‑oriented : Organized around business subjects to facilitate query and analysis.

Integrated : Data from diverse sources is cleansed, transformed, and standardized (ETL) to provide a unified view.

Relatively Stable : Data is primarily read‑only for analysis; updates occur in batch, preserving historical records.

Historical : Records include time stamps, enabling trend analysis over periods.

3. Related Concepts

Data Domain : An abstract collection of business dimensions that remains stable over time.

Business Process : Indivisible events such as order placement, payment, or bug submission.

Time Period : Defines the temporal scope for statistics (e.g., last 30 days).

Modifier : Qualifiers that further specify a metric (e.g., PC vs. wireless terminal).

Modifier Type : Categorization of modifiers (e.g., terminal type).

Subject : The specific aspect to be analyzed, consisting of dimensions and measures.

Dimension : Attributes describing entities (e.g., geographic, temporal).

Dimension Attribute : Individual fields within a dimension (e.g., country name).

Derived Metric : An atomic metric combined with optional modifiers and a time period.

Granularity : The level of detail of the data.

Fact Table : Stores full analytical events with measures at a defined granularity.

Dimension Table : Stores descriptive attributes for “who, what, where, when” aspects of events.

4. Hadoop System

HDFS : Distributed file system providing fault‑tolerant, high‑throughput storage for large data sets.

MapReduce : Distributed computation model that processes data in parallel via map and reduce phases.

HBase : Column‑oriented NoSQL database (Bigtable clone) offering scalable, real‑time read/write access.

Zookeeper : Coordination service for naming, configuration, and synchronization in distributed environments.

Sqoop : Tool for bulk data transfer between relational databases and Hadoop.

Pig : High‑level data flow language (Pig Latin) that compiles to MapReduce jobs.

Mahout : Library of scalable machine‑learning algorithms.

Flume : Distributed log collection system for ingesting large volumes of streaming data.

Hive : Data warehouse infrastructure that provides an SQL‑like query language (HQL) translating to MapReduce.

YARN : Resource management layer enabling multiple processing frameworks on Hadoop.

Tez : DAG‑based execution engine that breaks MapReduce into finer‑grained stages.

Spark : In‑memory parallel processing engine offering up to 100× speed improvements over MapReduce.

Kafka : Distributed messaging system for high‑throughput real‑time data streams.

Ambari : Web‑based tool for provisioning, managing, and monitoring Hadoop clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Analytics ETL Hadoop

Written by

Yanxuan Tech Team

NetEase Yanxuan Tech Team shares e-commerce tech insights and quality finds for mindful living. This is the public portal for NetEase Yanxuan's technology and product teams, featuring weekly tech articles, team activities, and job postings.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.