How Alibaba Cloud’s Data Lake Metadata Warehouse Transforms Big Data Management
This article explains the challenges of data lake adoption and details Alibaba Cloud’s metadata warehouse architecture, construction, search capabilities, asset analysis, fine‑grained profiling, and lifecycle management that together enable efficient, cloud‑native big data management.
Data Lake Metadata Warehouse Introduction
During data lake practice, many challenges arise: difficulty identifying and locating data, weak data‑asset management, and lack of systematic format‑optimization solutions.
Alibaba Cloud proposes a cloud‑native, fully managed solution built on a metadata warehouse and massive compute pool to address these challenges.
Metadata Warehouse Architecture
The warehouse aggregates lake metadata and analysis data, processes them through ETL, analysis, and computing to produce analysis, metric, and index stores that support upper‑layer applications.
The right side of the diagram shows a cloud‑native compute pool running Spark tasks such as Analyze, Indexing, Compaction, Tiering, writing results back to the warehouse, enriching metric and index stores, and enabling real‑time data quality analysis.
The control layer offers online services like catalog, search, and metric services; the optimization engine analyzes metrics and generates tasks for the compute pool. This yields capabilities:
Metadata capability: fast search and discovery of unknown data via a search index.
Storage optimization: statistical analysis of storage, partition‑level details, hot‑cold tiering, automatic partition tiering.
Query optimization: automatic small‑file merging without user intervention.
Lake format management: metadata acceleration and automatic optimization.
Metadata Warehouse Construction
The raw data in the warehouse consists of three categories:
Storage data: OSS file‑level information (size, path, type, update time) from access logs.
Metadata: catalog, database, table, partition, function definitions, indexes, attributes, stats, sourced from engine metadata services.
Engine behavior data: lineage, task execution details, file counts, dependencies, used for data maps and lifecycle management.
These data are ingested via log services, Spark batch jobs, offline sync, Spark Structured Streaming, and real‑time consumption into the warehouse, which uses Hologres for real‑time write, update, and analysis.
For low‑latency but large‑scale analysis, offline MaxCompute processing creates detailed data for the control layer; for high‑real‑time needs, Hologres provides instant analytics for DLF control.
Alibaba Cloud DLF Data Lake Management and Optimization
Metadata Search
Metadata search solves the data‑finding pain point with two modes: full‑text search via indexed columns delivering millisecond responses, and multi‑column precise queries (by database, table, column, location, creation time, etc.).
Elasticsearch is used as the index store, synchronized in near real‑time via Spark Streaming from DML logs.
To handle log ordering and reliability issues, recent update timestamps ensure order, and daily offline sync compensates for missing records.
Data Asset Analysis
Analysis dimensions include resource statistics, trend changes, storage ranking, and storage tiering, providing a comprehensive view of lake assets.
Resource statistics show total storage, number of databases/tables, and API access volume. Trend changes display 7‑day, 30‑day, and yearly variations. Storage ranking highlights cost‑heavy tables, and storage tiering reveals distribution of storage types, formats, and file sizes.
DataProfile – Fine‑grained Table Analysis
DataProfile extends engine stats with additional metrics such as small‑file ratio, hot‑cold degree, and tiering information. Because engine‑generated stats may be incomplete, the system proactively triggers stats analysis tasks to compute missing metrics.
When a DML event occurs, the warehouse records it, the stats cluster consumes the event, launches analysis tasks (including small‑file and tiering metrics), and writes results back to the metric store.
Lifecycle Management
The lifecycle module leverages OSS tiering (standard, infrequent, archive, cold archive) to freeze rarely used data and thaw it on demand, reducing storage costs.
By combining OSS tiering with engine metadata, the system provides partition‑level lifecycle policies that automatically archive or restore data based on recent modification time, creation time, partition value, and access frequency.
A rule engine evaluates these metrics, triggers archiving tasks via a distributed scheduler, and executes file‑level archiving using JindoSDK.
This approach simplifies user interaction with OSS files while enhancing data tier management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
