Design and Evolution of Volcano Engine DataLeap Data Catalog System
This article details the architecture, design decisions, and iterative improvements of the Data Catalog product within Volcano Engine's DataLeap suite, covering metadata management, ingestion pipelines, search optimization, lineage capabilities, storage layer enhancements, and future development directions.
Abstract: Data Catalog aggregates technical and business metadata to help data producers organize data and data consumers find and understand data, supporting data development and governance.
Background: Explains metadata concepts, the role of Data Catalog in providing richer business context, and its value for both producers and consumers in complex big‑data environments.
Old version pain points: Early implementation based on LinkedIn Wherehows faced scalability, extensibility, and maintainability issues as multiple storage engines and user scenarios grew.
New version goals: Aim to simplify metadata organization for producers, improve data discovery for consumers, and reduce new metadata integration time from months to days with a streamlined architecture.
Research and upgrade ideas: Conducted industry product research and defined upgrade directions focusing on search, lineage, and adopting Apache Atlas‑based data modeling.
Technical and product overview: Architecture consists of metadata ingestion (ETL Bridge, MQ, Clients), core services (Catalog Service, Ingestion Service, Resource Control Plane, Q&A Service, ML Service, API Layer), and storage layers (Meta Store on HBase, Index Store on Elasticsearch, Model Store on HDFS).
Key technologies: Unified data model (Type, Entity, Attribute, Relationship) with extensive inheritance and composition; adjusted type loading mechanism; standardized connectors (Source, Diff Operator, Event Generate Operator, Sink, Bridge Job); search optimization using rule‑based patterns and aggressive personalization; lineage construction across RDS, MQ, and compute/storage systems; storage optimizations such as JanusGraph MutilPreFetch and removal of global GUID uniqueness checks.
Future work: Transform metadata into valuable data assets, broaden intelligent features like recommendation and auto‑tagging, and open connector capabilities for ToB collaborations and tighter integration with reporting tools.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
