How ByteDance Optimized Data Catalog Performance with Apache Atlas and JanusGraph
This article details ByteDance's 2021 overhaul of its Data Catalog system, the performance regressions encountered after switching to Apache Atlas, and the step‑by‑step backend optimizations—including JanusGraph tuning, Gremlin query refactoring, parallel processing, and write‑path improvements—that reduced latency from minutes to seconds.
In 2021 ByteDance rebuilt its Data Catalog system, switching the storage layer to Apache Atlas. The migration exposed severe performance regressions.
Background
Originally based on LinkedIn Wherehows and only supporting Hive, the system accumulated many patches, adding MySQL, Elasticsearch and an internal graph database (veGraph). This made the codebase hard to maintain and scale.
The new version retained all product capabilities but replaced the storage layer with Apache Atlas. Importing existing data caused drastic read/write latency and CPU spikes, e.g., writing metadata for a Hive table with >3000 columns saturated CPU at 100% and timed out after minutes; displaying details for a table with dozens of columns took over a minute.
Overall Optimization Approach
Before diving into details, we outline our methodology for optimizing business‑oriented web services: define clear goals, pinpoint bottlenecks, choose cost‑effective solutions, and verify improvements quickly.
Set realistic performance targets; avoid premature or excessive tuning.
Focus on the most impactful bottleneck to maximize ROI.
Select solutions with the best cost‑performance trade‑off.
Validate changes rapidly with automated tests or monitoring.
Specific Optimizations for Data Catalog
JanusGraph Configuration
Two JanusGraph settings proved critical:
query.batch=true query.batch-property-prefetch=trueEnabling batch queries reduced the number of remote calls during metadata traversal.
Gremlin Query Refactoring
Original Gremlin query for counting downstream entity types took 2–3 seconds:
<code>g.V().has('__typeName', 'BusinessDomain')
.has('__qualifiedName', eq('XXXX'))
.out('r:DataStoreBusinessDomainRelationship')
.groupCount().by('__typeName')
.profile();</code>After simplifying the traversal and moving the aggregation to the property level, execution time dropped to ~50 ms:
<code>g.V().has('__typeName', 'BusinessDomain')
.has('__qualifiedName', eq('XXXX'))
.out('r:DataStoreBusinessDomainRelationship')
.values('__typeName').groupCount().by()
.profile();</code>Entity Graph Retrieval Adjustments
We modified
mapVertexToAtlasEntityto fetch only required vertex properties, enabled multi‑pre‑fetch, limited relationship depth to one level, and allowed selective edge‑type retrieval, cutting detail‑page latency from ~1 minute to under a second.
Parallel Processing for Large Lineage Queries
For N‑layer lineage extraction we introduced a thread pool with few core threads and many max threads, processing each newly fetched vertex in parallel before aggregating results, turning minute‑scale operations into seconds.
Write‑Path Optimizations
Metadata tables with >3000 columns suffered from costly uniqueness checks on
guidand
qualifiedName. By removing the global‑unique check for
guidand adding a
Global_Uniqueindex for
__qualifiedName, write latency dropped from minutes to seconds.
Results
CPU usage normalized; write‑throughput for wide tables improved dramatically.
Read latency for high‑column tables reduced from >1 minute to <1 second.
Overall system stability increased, enabling smoother feature rollout.
Conclusion
Performance tuning of business‑focused backend services should start from concrete use‑case bottlenecks, apply inexpensive yet effective fixes, and verify impact continuously. Simple configuration tweaks, query rewrites, and selective data fetching often yield the biggest gains without costly infrastructure changes.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.