Big Data 38 min read

Comprehensive Guide to Metadata Management, Data Quality, and Optimization in Big Data Systems

This article provides an in-depth overview of metadata concepts, their technical and business classifications, value in data management, applications such as data profiling and lineage, optimization techniques for compute and storage, lifecycle management, and comprehensive data quality assurance practices within large‑scale big data environments.

Big Data Technology & Architecture

Nov 22, 2022

Comprehensive Guide to Metadata Management, Data Quality, and Optimization in Big Data Systems

Chapter 1 Metadata

1.1 Overview of Metadata

Metadata connects source data, data warehouses, and data applications, recording the entire lifecycle from generation to consumption. It mainly records model definitions, mapping relationships, monitoring of warehouse status, and ETL task execution.

1.1.1 Definition of Metadata

Metadata is divided into two categories: Technical Metadata and Business Metadata.

Technical Metadata : Stores technical details of the data‑warehouse system, such as table names, partition information, owners, file sizes, table types, lifecycle, column definitions, and ETL task metadata (job logs, parameters, execution time, etc.). It also includes data‑sync, computation, scheduling information, data‑quality and operation metadata (monitoring logs, alerts, fault information).

Business Metadata : Describes data from a business perspective, providing a semantic layer that allows non‑technical users to understand warehouse data.

1.1.2 Value of Metadata

Metadata underpins data management, data content extraction, and data application:

Supports governance in computing, storage, cost, quality, security, and modeling.

Enables extraction and analysis of data domains, topics, and business attributes (e.g., building knowledge graphs, tagging data).

Facilitates product and application chains, ensuring accurate and timely data delivery.

1.1.3 Building a Unified Metadata System

The quality of metadata directly impacts data‑management accuracy. The goal is to connect data ingestion, processing, and consumption, standardize metadata models, provide a unified service endpoint, and ensure stable, high‑quality metadata output.

1.2 Metadata Applications

Metadata drives data‑driven decisions and digital operations:

Helps analysts discover trends and take actions.

Allows data consumers to quickly locate required data.

Guides ETL engineers in model design, task optimization, and job deprecation.

Assists operations engineers in storage, compute, and system optimization.

1.2.1 Data Profile

The core idea is to build a clear lineage graph for complex data using graph computation and label‑propagation algorithms. Four label types are created:

Basic labels (storage, access, security level).

Warehouse labels (incremental vs. full, reproducibility, lifecycle).

Business labels (topic domain, product line, business type).

Potential labels (possible application scenarios such as social, media, advertising, e‑commerce, finance).

1.2.2 Metadata Portal

Front‑end product: data map for data discovery.

Back‑end product: one‑stop data management (cost, security, quality).

1.2.3 Application‑Link Analysis

Generates table‑level, field‑level, and application‑level lineage. Table‑level lineage can be derived from MR job logs or task dependencies. Common use cases include impact analysis, importance analysis, deprecation analysis, link tracing, root‑cause analysis, and fault diagnosis.

1.2.4 Data Modeling

Metadata‑driven warehouse modeling improves efficiency. It records table basics (downstream usage, query count, join count, aggregation count, production time), table relationships, field basics (name, comment, query count, join count, aggregation count, filter count), and clarifies SQL operations (SELECT, JOIN, GROUP BY, WHERE).

In star‑schema design, metadata helps filter tables based on downstream usage, field characteristics (e.g., time fields), and primary‑foreign key relationships.

1.2.5 Driving ETL Development

Chapter 2 Compute Management

2.1 System Optimization

2.1.1 HBO (History‑Based Optimizer)

When tasks are stable, resources can be evaluated based on historical execution, improving CPU, memory, instance concurrency, and reducing execution time. HBO can dynamically adjust instance count for large‑scale events based on data growth.

2.1.2 CBO (Cost‑Based Optimizer)

CBO selects the lowest‑cost execution plan using collected statistics. It introduces JoinReorder, AutoMapJoin, and predicate push‑down. Limitations include UDF push‑down restrictions, nondeterministic functions, and implicit type conversions.

2.2 Task Optimization

2.2.1 Map Skew

Uneven file size distribution causes some Map instances to process far more data, leading to long tails. Mitigation includes merging small files upstream and using distribute by rand() to rebalance.

2.2.2 Join Skew

Data skew in joins creates long tails, especially during high‑traffic events. Solutions: use MapJoin when one side is small, replace nulls with random values, or split hot keys into hot and non‑hot partitions before joining.

2.2.3 Reduce Skew

Key distribution imbalance causes Reduce‑side long tails. Approaches include pre‑aggregating, handling hot keys separately, adjusting dynamic partition numbers, and reducing multiple DISTINCT operations by early GROUP BY.

Chapter 3 Storage and Cost Management

3.1 Data Compression

Archive compression reduces three‑copy storage from a 1:3 ratio to about 1:1.5, at the cost of longer recovery time and reduced read performance. Suitable for cold backup and log data.

3.2 Data Redistribution

Column‑store tables have varying data distribution; redistributing data (using distribute by and sort by) mitigates column hotspots and saves storage.

3.3 Storage Governance Optimization

Unmanaged tables, empty tables, tables not accessed in 62 days, tables without updates or tasks, large development tables without access, long‑cycle tables, etc.

3.4 Lifecycle Management

The goal is to meet business needs with minimal storage cost, maximizing data value.

3.4.1 Lifecycle Policies

Periodic deletion

Permanent deletion

Permanent retention

Extreme storage

Cold‑data management

Incremental‑to‑full table merge (using order dates as partitions, keeping only one copy per order).

3.4.2 General Lifecycle Matrix

Historical data grading:

PO – critical data (non‑recoverable, e.g., transactions, logs).

P1 – important business data (non‑recoverable).

P2 – important data (recoverable, e.g., intermediate ETL outputs).

P3 – less important data (recoverable, e.g., reports).

3.5 Data Cost Measurement

Data cost consists of storage cost, compute cost, and scan cost, reflecting upstream‑downstream dependencies.

Scan cost – scanning upstream tables.

Storage cost – resources consumed by tables.

Compute cost – CPU consumption during processing.

3.6 Data Usage Billing

Based on the three cost components, billing can be split into compute, storage, and scan fees. Cost measurement helps evaluate processing complexity, chain length, and dependency rationality, guiding model optimization and improving data‑integration efficiency.

Chapter 4 Data Quality

4.1 Data Quality Assurance Principles

Alibaba evaluates data quality on four dimensions: completeness, accuracy, consistency, and timeliness.

Completeness : No missing records or fields.

Accuracy : Data matches real‑world facts; rule‑based checks detect anomalies.

Consistency : Uniform representation across multiple warehouse branches (e.g., user ID format).

Timeliness : Data is produced promptly for downstream consumption (e.g., daily reports before 9 AM).

4.2 Overview of Data Quality Methods

Alibaba’s data‑quality framework includes consumption‑scene awareness, production‑stage checkpoints, risk‑point monitoring, quality scoring, and supporting tools.

4.2.1 Consumption‑Scene Awareness

Data engineers face petabytes of data; asset‑level classification (A1‑A5) helps prioritize protection and identify obsolete data.

4.2.2 Production‑Stage Checkpoints

Online (OLTP) and offline (OLAP) systems have distinct checkpoints. Online checkpoints ensure business changes are communicated downstream; offline checkpoints involve code review, regression testing, and task‑level validation.

4.2.3 Risk‑Point Monitoring

Online monitoring uses the real‑time Business Check Platform (BCP) to apply rule‑based validation and generate alerts. Offline monitoring relies on Data Quality Center (DQC) for accuracy checks and Mosad for timeliness alerts, with strong (blocking) and weak (non‑blocking) rules.

4.2.4 Quality Scoring

Metrics include night‑shift rate (frequency of after‑hours interventions) and quality‑event count. Serious incidents are escalated to faults, classified from P1 to P4 based on impact, and reviewed to prevent recurrence.

4.2.5 Fault Handling Process

Faults are identified, resolved quickly, and communicated to stakeholders. Post‑mortems document causes, responsibilities, and corrective actions. If this article helped you, remember to "watch", "like", and "collect".