Unlocking Data Value: How Metadata Drives Efficient Data Management and Quality
This comprehensive guide explains how metadata connects source data, warehouses, and applications, outlines its technical and business classifications, demonstrates its value for data management, profiling, portals, and ETL development, and details optimization, storage, lifecycle, and quality practices essential for robust big‑data operations.
Metadata Overview
Metadata connects source data, data warehouses, and data applications, recording the entire lifecycle from generation to consumption. It includes technical metadata (system details, storage info, job logs, data synchronization, quality, and operations) and business metadata (a semantic layer for non‑technical users).
Value of Metadata
Supports data governance across computing, storage, cost, quality, security, and modeling.
Enables extraction and analysis of data domains, topics, and business attributes for knowledge‑graph construction.
Ensures accurate, timely product data by linking MaxCompute and application data.
Unified Metadata System Construction
The goal is to bridge data ingestion, processing, and consumption, standardize metadata models, provide a unified service endpoint, and guarantee stable, high‑quality metadata output.
Metadata Applications
Data‑driven decision making, rapid data discovery for users, guidance for ETL engineers in model design and task optimization, and operational insights for cluster storage, compute, and system tuning.
Data Profile
Core idea: build a clear lineage graph using graph computation and label‑propagation algorithms, creating four label types—basic, warehouse, business, and potential.
Metadata Portal
Front‑end data map for data discovery and consumption.
Back‑end data management for cost, security, and quality control.
Application Link Analysis
Generates table‑level, field‑level, and application‑level lineage via MR task‑log parsing or task‑dependency parsing. Common uses include impact analysis, importance analysis, offline analysis, link tracing, root‑cause investigation, and fault diagnosis.
Data Modeling
Metadata‑driven warehouse modeling improves efficiency. Includes table basic metadata (downstream usage, query count, association count, aggregation count, production time), association metadata (related tables, types, fields, counts), and field metadata (name, comment, query count, association count, aggregation count, filter count). SQL operations are defined as SELECT, JOIN, GROUP BY, and WHERE.
ETL Development Driven by Metadata
Chapter 2: Compute Management
System Optimization
History‑Based Optimizer (HBO)
Improves CPU utilization.
Improves memory utilization.
Increases instance concurrency.
Reduces execution time.
Cost‑Based Optimizer (CBO)
Uses collected statistics to estimate the cost of each execution plan and selects the optimal one. Supports join reordering, auto‑MapJoin, rule whitelist/blacklist, and predicate push‑down while considering UDF restrictions, uncertain functions, and implicit type casts.
Task Optimization
Map Skew
Uneven file size distribution causes some Map instances to process far more data, leading to long tails. Solutions: merge small files and adjust node parameters, or redistribute data using distribute by rand().
Join Skew
When a join is skewed, use MapJoin if one side is small, randomize null values, or split hot keys into hot and non‑hot partitions before joining.
Reduce Skew
Key distribution imbalance causes Reduce‑side long tails. Mitigations include handling multiple COUNT(DISTINCT) operations, separating hot keys, reducing small partitions, and performing early GROUP BY to limit data explosion.
Chapter 3: Storage and Cost Management
Data Compression
Archive compression improves storage ratio to about 1:1.5 (from 1:3) but increases read latency; suitable for cold backup and log data.
Data Redistribution
Adjust column‑store distribution via DISTRIBUTEBY and SORTBY to avoid hotspots and save storage space, typically applied to tables with >15% redistribution benefit.
Storage Governance
Unmanaged tables
Empty tables
Tables not accessed in the last 62 days
Tables without updates or tasks
Large development‑library tables without access
Long‑lifecycle tables
Lifecycle Management
The purpose is to meet maximum business demand with minimal storage cost, maximizing data value.
Lifecycle Policies
Periodic deletion
Permanent deletion
Permanent retention
Extreme storage
Cold‑data management
Incremental‑table merge with full‑table strategy
Asset Level Matrix
Defines PO, P1, P2, P3 levels with corresponding importance.
Data Asset Levels
Five levels: Destructive (A1), Global (A2), Local (A3), General (A4), Unknown (A5). Higher‑level assets take precedence when data appears in multiple scenarios.
Chapter 4: Data Quality
Quality Assurance Principles
Four dimensions: completeness, accuracy, consistency, timeliness.
Completeness
Ensures records and fields are fully present.
Accuracy
Ensures data matches real‑world facts; uses rule‑based validation.
Consistency
Maintains uniform definitions across multiple data warehouses.
Timeliness
Data must be produced promptly to support decision making.
Data Quality Methods Overview
Four steps: consumption‑scenario awareness, production‑process checkpoints, risk‑point monitoring, and quality measurement.
Consumption Scenario Awareness
Data engineers identify which PB‑level data require protection; solution: assign asset‑level labels (A1‑A5).
Production Process Checkpoints
Online checkpoints include release‑platform notifications and database change alerts; offline checkpoints involve code scanning with SQLSCAN, pre‑release testing (code review, regression testing), dry‑run and real‑run testing, and change notifications before node or data refresh.
Risk Point Monitoring
Online monitoring via BCP (real‑time business check platform) with rule‑based alerts; offline monitoring via DQC (data quality center) for accuracy checks and Mosad for error, delay, and custom alerts, supporting strong (blocking) and weak (non‑blocking) rules.
Quality Measurement
Metrics include night‑shift rate, quality events, fault handling process, and fault levels (P1‑P4) with post‑mortem reviews to prevent recurrence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
