Big Data 12 min read

How ByteDance Built a Scalable Data Catalog: Key Technologies and Future Plans

ByteDance’s Data Catalog article details the system’s unified metadata model, standardized ingestion connectors, search optimization techniques, lineage capabilities, and storage layer enhancements, highlighting key technical designs, performance improvements, and future work to advance data governance and asset utilization.

ByteDance Data Platform
ByteDance Data Platform
ByteDance Data Platform
How ByteDance Built a Scalable Data Catalog: Key Technologies and Future Plans

As a data catalog product, Data Catalog aggregates technical and business metadata to help data producers organize data and data consumers find and understand it, supporting data development and governance.

01 - Data Model Unification

Unifying different metadata models reduces integration and maintenance costs. The system follows Apache Atlas’s design, defining Types, Entities, Attributes, and Relationships.

Type : describes a class of metadata, composed of attributes, e.g., Hive table.

Entity : an instance of a Type, which can be nested within other entities.

Attribute : a property of a Type, with its own type name.

Relationship : a special Entity describing links between two Entities.

Extensive use of inheritance and composition allows parent Types (e.g., DataStore) to be shared across Hive and ClickHouse tables, and behaviors like collection or likes are modeled as separate entities linked via relationships.

02 - Data Ingestion Standardization

After unifying the type system, the next step is standardizing the ingestion process. Each metadata type is wrapped in a connector, and an SDK simplifies connector development. Using a Flink‑style pipeline, the connector consists of Source, Diff Operator, Event Generate Operator, Sink, and Bridge Job.

03 - Search Optimization

Search is the most widely used feature in Data Catalog, serving over 70% of daily users. Two optimization strategies are applied:

Rule‑based optimizations for strong query patterns, such as recognizing “database.table” syntax.

Aggressive personalization based on limited user behavior data, boosting scores for queries similar to a user’s past interactions.

04 - Lineage Capability

End‑to‑end lineage tracks data from sources like RDS and MQ through processing and storage to downstream metrics and reports. The system collects lineage information in near real‑time, writes it to the catalog, and exposes it via APIs for downstream consumption.

05 - Storage Layer Optimization

The catalog uses Apache Atlas with JanusGraph as the graph engine and Elasticsearch for indexing. To handle millions of vertices and edges, two key optimizations were applied:

Read Optimization: Enable MultiPreFetch

Batch‑parallel fetching of vertex properties reduces query latency for high‑degree vertices.

Write Optimization: Remove Global GUID Uniqueness Check

Replacing the global GUID check with a business‑level qualifiedName check cuts write time dramatically, as shown by performance benchmarks.

Future Work

Upcoming efforts focus on converting metadata into valuable data assets, expanding intelligent capabilities such as recommendation and auto‑tagging, and exposing connector functionality as a marketplace offering.

Big DataStorage Optimizationdata lineagemetadata managementsearch optimizationData Catalog
ByteDance Data Platform
Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.