Big Data 10 min read

Mastering Data Projects: From Collection to Modeling in the Big Data Era

This article walks through the four essential stages of building a data project—data collection, modeling, analysis, and application—explaining key principles, common models such as 3NF, star/snowflake, cube, and wide tables, and comparing offline versus real‑time pipelines.

Weimob Technology Center

May 20, 2022

Mastering Data Projects: From Collection to Modeling in the Big Data Era

Overview

In the era of big data, rapid advances in computing and storage enable data‑driven business growth. Successful data projects require close collaboration between technical and business teams, with engineers deepening business understanding and business users learning how to leverage data.

The article outlines the four key stages of building a data project: data collection, data modeling, data analysis, and data application.

Data Collection

Accurate, complete, and timely data sources are the foundation of any data‑driven initiative. Data collection typically involves three categories: front‑end logs (user actions), back‑end logs (service events), and business data (database tables).

Effective log collection follows the principles of “completeness”, “granularity”, and “timeliness”. Completeness means covering all user types, platforms, and data sources; granularity requires capturing detailed event information (who, when, where, how, what); timeliness ensures data is fresh enough for real‑time decision making.

Various collection methods—full‑stack, visual, or code‑based instrumentation—are chosen based on product stage and requirements. The internal platform at Weimeng provides event‑model‑driven point registration, composite events, testing, and quality monitoring.

Data Modeling

Data models are built for specific analytical needs and are not one‑size‑fits‑all. Common models in data warehouses include 3NF, dimensional (star and snowflake), cube, and wide‑table models.

3NF Model

First Normal Form ensures atomic columns; Second Normal Form requires a primary key; Third Normal Form eliminates redundant relationships. While 3NF suits OLTP systems, its strictness can hinder analytical queries.

Dimensional Model

Two main variants are star schema and snowflake schema. The star schema separates fact tables (transaction records) from dimension tables (reference data). The snowflake schema further normalizes dimensions to reduce redundancy.

Cube Model

Cubes store pre‑aggregated fact data across multiple dimensions, enabling fast multi‑dimensional analysis (roll‑up and drill‑down). Tools such as Apache Kylin pre‑compute combinations, while ClickHouse offers real‑time query capabilities without pre‑aggregation.

Wide‑Table Model

A wide table denormalizes data to store many attributes in a single table, improving query performance in modern big‑data environments; ClickHouse leverages this model for efficient analytics.

Modeling Approaches: Offline vs Real‑Time

Offline models ingest data into HDFS, transform it with batch tools (Hive, Spark), and load results into OLAP or OLTP stores. Real‑time models process streams via Kafka, store dimension data in KV stores like HBase, and output results to databases or message queues. High‑performance OLAP databases (ClickHouse, StarRocks) can also provide near‑real‑time analytics.

Key differences:

Layering: Offline models often have many layers to trade space for speed; real‑time models have fewer layers to reduce latency.

Storage: Offline relies on HDFS; real‑time uses MQ, KV stores, or OLAP databases.

ETL: Offline uses batch engines (Hive, Spark); real‑time uses streaming engines (Flink, Storm).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection data pipeline Real-time analytics Data Modeling offline modeling

Written by

Weimob Technology Center

Official platform of the Weimob Technology Center

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.