Big Data 35 min read

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

This comprehensive guide explains how metadata connects source data, warehouses, and applications, outlines its technical and business classifications, demonstrates its value for data management, profiling, portals, and ETL development, and details optimization, storage, lifecycle, and quality practices essential for robust big‑data operations.

Data Thinking Notes

Nov 28, 2022

Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality

Metadata Overview

Metadata connects source data, data warehouses, and data applications, recording the entire lifecycle from generation to consumption. It includes technical metadata (system details, storage info, job logs, data synchronization, quality, and operations) and business metadata (a semantic layer for non‑technical users).

Value of Metadata

Supports data governance across computing, storage, cost, quality, security, and modeling.

Enables extraction and analysis of data domains, topics, and business attributes for knowledge‑graph construction.

Ensures accurate, timely product data by linking MaxCompute and application data.

Unified Metadata System Construction

The goal is to bridge data ingestion, processing, and consumption, standardize metadata models, provide a unified service endpoint, and guarantee stable, high‑quality metadata output.

Metadata Applications

Data‑driven decision making, rapid data discovery for users, guidance for ETL engineers in model design and task optimization, and operational insights for cluster storage, compute, and system tuning.

Data Profile

Core idea: build a clear lineage graph using graph computation and label‑propagation algorithms, creating four label types—basic, warehouse, business, and potential.

Metadata Portal

Front‑end data map for data discovery and consumption.

Back‑end data management for cost, security, and quality control.

Application Link Analysis

Generates table‑level, field‑level, and application‑level lineage via MR task‑log parsing or task‑dependency parsing. Common uses include impact analysis, importance analysis, offline analysis, link tracing, root‑cause investigation, and fault diagnosis.

Data Modeling

Metadata‑driven warehouse modeling improves efficiency. Includes table basic metadata (downstream usage, query count, association count, aggregation count, production time), association metadata (related tables, types, fields, counts), and field metadata (name, comment, query count, association count, aggregation count, filter count). SQL operations are defined as SELECT, JOIN, GROUP BY, and WHERE.

ETL Development Driven by Metadata

Chapter 2: Compute Management

System Optimization

History‑Based Optimizer (HBO)

Improves CPU utilization.

Improves memory utilization.

Increases instance concurrency.

Reduces execution time.

Cost‑Based Optimizer (CBO)

Uses collected statistics to estimate the cost of each execution plan and selects the optimal one. Supports join reordering, auto‑MapJoin, rule whitelist/blacklist, and predicate push‑down while considering UDF restrictions, uncertain functions, and implicit type casts.

Task Optimization

Map Skew

Uneven file size distribution causes some Map instances to process far more data, leading to long tails. Solutions: merge small files and adjust node parameters, or redistribute data using distribute by rand().

Join Skew

When a join is skewed, use MapJoin if one side is small, randomize null values, or split hot keys into hot and non‑hot partitions before joining.

Reduce Skew

Key distribution imbalance causes Reduce‑side long tails. Mitigations include handling multiple COUNT(DISTINCT) operations, separating hot keys, reducing small partitions, and performing early GROUP BY to limit data explosion.

Chapter 3: Storage and Cost Management

Data Compression

Archive compression improves storage ratio to about 1:1.5 (from 1:3) but increases read latency; suitable for cold backup and log data.

Data Redistribution

Adjust column‑store distribution via DISTRIBUTEBY and SORTBY to avoid hotspots and save storage space, typically applied to tables with >15% redistribution benefit.

Storage Governance

Unmanaged tables

Empty tables

Tables not accessed in the last 62 days

Tables without updates or tasks

Large development‑library tables without access

Long‑lifecycle tables

Lifecycle Management

The purpose is to meet maximum business demand with minimal storage cost, maximizing data value.

Lifecycle Policies

Periodic deletion

Permanent deletion

Permanent retention

Extreme storage

Cold‑data management

Incremental‑table merge with full‑table strategy

Asset Level Matrix

Defines PO, P1, P2, P3 levels with corresponding importance.

Data Asset Levels

Five levels: Destructive (A1), Global (A2), Local (A3), General (A4), Unknown (A5). Higher‑level assets take precedence when data appears in multiple scenarios.

Chapter 4: Data Quality

Quality Assurance Principles

Four dimensions: completeness, accuracy, consistency, timeliness.

Completeness

Ensures records and fields are fully present.

Accuracy

Ensures data matches real‑world facts; uses rule‑based validation.

Consistency

Maintains uniform definitions across multiple data warehouses.

Timeliness

Data must be produced promptly to support decision making.

Data Quality Methods Overview

Four steps: consumption‑scenario awareness, production‑process checkpoints, risk‑point monitoring, and quality measurement.

Consumption Scenario Awareness

Data engineers identify which PB‑level data require protection; solution: assign asset‑level labels (A1‑A5).

Production Process Checkpoints

Online checkpoints include release‑platform notifications and database change alerts; offline checkpoints involve code scanning with SQLSCAN, pre‑release testing (code review, regression testing), dry‑run and real‑run testing, and change notifications before node or data refresh.

Risk Point Monitoring

Online monitoring via BCP (real‑time business check platform) with rule‑based alerts; offline monitoring via DQC (data quality center) for accuracy checks and Mosad for error, delay, and custom alerts, supporting strong (blocking) and weak (non‑blocking) rules.

Quality Measurement

Metrics include night‑shift rate, quality events, fault handling process, and fault levels (P1‑P4) with post‑mortem reviews to prevent recurrence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Operations Data Quality Data Warehouse

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.