Huolala’s Metadata Platform: Scaling Data Lineage, AI Search & Cost Governance
Huolala’s data team details the evolution of its metadata management platform—covering architecture, stages from early Hive‑ETL to real‑time field‑level lineage, AI‑driven smart search, cost‑governance mechanisms, and security classifications—showcasing practical solutions for data discoverability, efficiency, and protection at scale.
Abstract
Today’s article is based on a presentation by Zhang Fang and Wu Gang at DataFunSummit2024, describing the evolution and practice of Huolala’s big‑data metadata management system, including data lineage, AI‑intelligent search, cost governance, data security, and future plans.
1. Huolala Big Data System Overview
The overall architecture is built bottom‑up. The foundation and ingestion layers provide storage, compute, and data‑ingestion capabilities. Above them, the platform and data‑warehouse layer integrates data‑development, governance, and asset‑management tools. The service layer supports diverse application scenarios, and the top application layer delivers decision‑support and business‑empowerment services.
The metadata management platform serves as the metadata hub, responsible for metadata management, data‑asset management, and cost governance.
2. Metadata Management Platform
When the big‑data system scales, challenges arise: locating data, understanding upstream‑downstream relationships, driving governance, and managing assets. The platform addresses these four core issues.
2.1 Platform Evolution
Early stage (pre‑2021): simple business, small asset scale, Hive ETL, basic metadata queries.
Development stage (2021‑2022): rapid growth, need for “find data” and “find relationships”, built a metadata system supporting asset name/description search and data lineage.
Mature stage (post‑2022): open‑source and self‑built components, support for many asset types, real‑time field‑level lineage, cost governance, and AI‑intelligent search using large models.
2.2 System Architecture
The architecture consists of four core layers: collection, service, storage, and platform data‑warehouse, with diverse accessors on one side and rich applications on the other.
Accessors: infrastructure, platform tools, and business systems that provide source data.
Collection layer: adapts and collects various metadata from producers.
Service layer: offers query and analysis services.
Platform data‑warehouse layer: processes cost‑governance data and analyzes resource consumption.
Application layer: supports scenarios such as data discovery, relationship tracing, and asset management.
3. Practice
3.1 Data Lineage
The data flow follows four stages: collection → storage → compute → application. Lineage records each stage, enabling quick root‑cause analysis when issues arise.
Architecture Evolution
Version 1.0: basic lineage via Hive hook, capturing input/output tables.
Non‑SQL links: standardized format for MySQL‑to‑Hive, Hive‑to‑downstream, etc.
Version 2.0: real‑time incremental updates, task‑change reporting, and field‑level lineage.
Storage upgrade to Neo4j graph database for efficient multi‑layer node queries.
Parsing challenges such as coupling lineage with tasks, task updates, and temporary tables were solved by extending HiveConf to carry task info, implementing task‑change reporting, and using a cache‑based merging strategy to collapse temporary‑table nodes into direct lineage.
3.2 AI‑Intelligent Search
From simple keyword search to Retrieval‑Augmented Generation (RAG) with large models. The system extracts context, stores vectors in a vector database, performs semantic retrieval, re‑ranks results with a lightweight model, and generates answers via prompts, supporting multi‑keyword queries, data definition, and business consultation.
3.3 Cost Governance
Metadata drives asset‑level cost visualization, storage‑compute waste reduction, and automated lifecycle management (deletion after 180 days, archiving after 90 days). A health‑score model quantifies governance impact, with department‑level dashboards and incentive mechanisms encouraging cost‑saving actions.
3.4 Data Security
Metadata is classified into four levels (C1‑C4) following financial data‑security guidelines. Classification guides permission controls, encryption, and audit. Scenarios include table‑level security, data‑download approvals, and automated labeling of sensitive fields. Policies restrict access based on employee tenure, role, and outsourcing status, with escalation for C3/C4 data.
4. Future Planning
More efficient AI‑driven retrieval services.
Standardized SDK for data lineage to improve quality and reduce duplication.
Further automation of cost‑governance to continuously reduce expenses.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
