Big Data 27 min read

Huolala’s Metadata Platform: Scaling Data Lineage, AI Search & Cost Governance

Huolala’s data team details the evolution of its metadata management platform—covering architecture, stages from early Hive‑ETL to real‑time field‑level lineage, AI‑driven smart search, cost‑governance mechanisms, and security classifications—showcasing practical solutions for data discoverability, efficiency, and protection at scale.

Huolala Tech

Dec 5, 2024

Huolala’s Metadata Platform: Scaling Data Lineage, AI Search & Cost Governance

Abstract

Today’s article is based on a presentation by Zhang Fang and Wu Gang at DataFunSummit2024, describing the evolution and practice of Huolala’s big‑data metadata management system, including data lineage, AI‑intelligent search, cost governance, data security, and future plans.

1. Huolala Big Data System Overview

The overall architecture is built bottom‑up. The foundation and ingestion layers provide storage, compute, and data‑ingestion capabilities. Above them, the platform and data‑warehouse layer integrates data‑development, governance, and asset‑management tools. The service layer supports diverse application scenarios, and the top application layer delivers decision‑support and business‑empowerment services.

The metadata management platform serves as the metadata hub, responsible for metadata management, data‑asset management, and cost governance.

2. Metadata Management Platform

When the big‑data system scales, challenges arise: locating data, understanding upstream‑downstream relationships, driving governance, and managing assets. The platform addresses these four core issues.

2.1 Platform Evolution

Early stage (pre‑2021): simple business, small asset scale, Hive ETL, basic metadata queries.

Development stage (2021‑2022): rapid growth, need for “find data” and “find relationships”, built a metadata system supporting asset name/description search and data lineage.

Mature stage (post‑2022): open‑source and self‑built components, support for many asset types, real‑time field‑level lineage, cost governance, and AI‑intelligent search using large models.

2.2 System Architecture

The architecture consists of four core layers: collection, service, storage, and platform data‑warehouse, with diverse accessors on one side and rich applications on the other.

Accessors: infrastructure, platform tools, and business systems that provide source data.

Collection layer: adapts and collects various metadata from producers.

Service layer: offers query and analysis services.

Platform data‑warehouse layer: processes cost‑governance data and analyzes resource consumption.

Application layer: supports scenarios such as data discovery, relationship tracing, and asset management.

3. Practice

3.1 Data Lineage

The data flow follows four stages: collection → storage → compute → application. Lineage records each stage, enabling quick root‑cause analysis when issues arise.

Architecture Evolution

Version 1.0: basic lineage via Hive hook, capturing input/output tables.

Non‑SQL links: standardized format for MySQL‑to‑Hive, Hive‑to‑downstream, etc.

Version 2.0: real‑time incremental updates, task‑change reporting, and field‑level lineage.

Storage upgrade to Neo4j graph database for efficient multi‑layer node queries.

Parsing challenges such as coupling lineage with tasks, task updates, and temporary tables were solved by extending HiveConf to carry task info, implementing task‑change reporting, and using a cache‑based merging strategy to collapse temporary‑table nodes into direct lineage.

3.2 AI‑Intelligent Search

From simple keyword search to Retrieval‑Augmented Generation (RAG) with large models. The system extracts context, stores vectors in a vector database, performs semantic retrieval, re‑ranks results with a lightweight model, and generates answers via prompts, supporting multi‑keyword queries, data definition, and business consultation.

3.3 Cost Governance

Metadata drives asset‑level cost visualization, storage‑compute waste reduction, and automated lifecycle management (deletion after 180 days, archiving after 90 days). A health‑score model quantifies governance impact, with department‑level dashboards and incentive mechanisms encouraging cost‑saving actions.

3.4 Data Security

Metadata is classified into four levels (C1‑C4) following financial data‑security guidelines. Classification guides permission controls, encryption, and audit. Scenarios include table‑level security, data‑download approvals, and automated labeling of sensitive fields. Policies restrict access based on employee tenure, role, and outsourcing status, with escalation for C3/C4 data.

4. Future Planning

More efficient AI‑driven retrieval services.

Standardized SDK for data lineage to improve quality and reduce duplication.

Further automation of cost‑governance to continuously reduce expenses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data lineage Data Security metadata management AI search cost governance

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.