Big Data 12 min read

Why 80% of Data Analysis Time Is Spent on Data Preparation—and How to Master It

Data preparation consumes about 80% of the entire analytics workflow, making data collection, quality assurance, and governance critical pillars—spanning metadata, master data, storage layers like data lakes and warehouses, and rigorous preprocessing—to turn raw information into reliable insights.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Why 80% of Data Analysis Time Is Spent on Data Preparation—and How to Master It

Good data is the foundation of any analysis; without it, even the most dazzling visualizations are meaningless. Research shows that data preparation occupies roughly 80% of the total time in a data analysis project, making the collection, cleaning, and readiness of data the most labor‑intensive tasks.

The DAMA International Data Management Body of Knowledge defines data management (DM) as the set of processes that plan, establish, execute, and monitor activities throughout a data asset’s lifecycle to deliver, control, protect, and enhance its value.

DAMA categorises data‑management functions into eleven areas: data governance, data architecture, data modelling and design, data storage and operation, data security, data integration and interoperability, file and content management, reference data, master data, data warehousing and business intelligence, metadata, and data quality.

01 Data “Govern” (Management)

This layer covers the four data tiers—metadata, master data, reference data, and general (transaction) data—ensuring source reliability, content accuracy, security, and granularity.

Metadata : data about data, such as names, attributes, classifications, and tags.

Reference Data : standardized values that classify other data, acting as a data dictionary.

Master Data : authoritative, high‑value data about core business entities, often called “golden” data.

General Data : transactional data that changes with business operations.

02 Data “Store” (Storage)

Storage comprises three key concepts: data lakes, data warehouses, and data marts. A data lake holds massive raw data of various types, a warehouse structures historical and current data for strategic use, and a data mart provides curated, department‑specific datasets for immediate analysis.

03 Data “Compute” (Processing)

Processing involves data preprocessing, cleaning, and transformation—collectively the “compute” step. It includes simple cleaning, advanced algorithmic cleansing, and the ETL pipeline (Extract, Transform, Load) that links data cleaning, transformation, and integration.

Effective preprocessing distinguishes useful “information units” from noisy data, preventing a data lake from turning into a data swamp.

04 Data “Standard” (Regulation)

Standards define the rules for data itself (data standards) and for data management (governance policies). Good standards exhibit six qualities—uniqueness, uniformity, universality, stability, foresight, feasibility—and follow modular and systematic principles. They span semantic, structural, and content standards to align terminology, naming, and labeling across the organization.

05 Data “Govern” (Governance)

Governance implements the standards through policies, organizational structures, and mechanisms, ensuring data quality, consistency, and accuracy. Successful governance requires top‑down commitment, dedicated data stewardship teams, and balanced mechanisms that address responsibility, monitoring, and execution.

In practice, the biggest challenge is not technology but convincing all departments to cooperate in data collection and quality assurance.

Overall, mastering the five pillars—“Govern”, “Store”, “Compute”, “Standard”, and “Govern” (management)—provides a comprehensive framework for data asset management, with the first three being industry‑standard knowledge and the latter two needing customization to each organization’s context.

big dataETLdata managementdata governancedata preparation
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.