Big Data 16 min read

Design and Evolution of Volcano Engine DataLeap Data Catalog System

This article details the architecture, design decisions, and iterative improvements of the Data Catalog product within Volcano Engine's DataLeap suite, covering metadata management, ingestion pipelines, search optimization, lineage capabilities, storage layer enhancements, and future development directions.

Big Data Technology & Architecture

Jul 12, 2023

Design and Evolution of Volcano Engine DataLeap Data Catalog System

Abstract: Data Catalog aggregates technical and business metadata to help data producers organize data and data consumers find and understand data, supporting data development and governance.

Background: Explains metadata concepts, the role of Data Catalog in providing richer business context, and its value for both producers and consumers in complex big‑data environments.

Old version pain points: Early implementation based on LinkedIn Wherehows faced scalability, extensibility, and maintainability issues as multiple storage engines and user scenarios grew.

New version goals: Aim to simplify metadata organization for producers, improve data discovery for consumers, and reduce new metadata integration time from months to days with a streamlined architecture.

Research and upgrade ideas: Conducted industry product research and defined upgrade directions focusing on search, lineage, and adopting Apache Atlas‑based data modeling.

Technical and product overview: Architecture consists of metadata ingestion (ETL Bridge, MQ, Clients), core services (Catalog Service, Ingestion Service, Resource Control Plane, Q&A Service, ML Service, API Layer), and storage layers (Meta Store on HBase, Index Store on Elasticsearch, Model Store on HDFS).

Key technologies: Unified data model (Type, Entity, Attribute, Relationship) with extensive inheritance and composition; adjusted type loading mechanism; standardized connectors (Source, Diff Operator, Event Generate Operator, Sink, Bridge Job); search optimization using rule‑based patterns and aggressive personalization; lineage construction across RDS, MQ, and compute/storage systems; storage optimizations such as JanusGraph MutilPreFetch and removal of global GUID uniqueness checks.

Future work: Transform metadata into valuable data assets, broaden intelligent features like recommendation and auto‑tagging, and open connector capabilities for ToB collaborations and tighter integration with reporting tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Connector metadata management search optimization Lineage Data Catalog Apache Atlas

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.