Backend Development 12 min read

How ByteDance Optimized Data Catalog Performance with Apache Atlas and JanusGraph

This article details ByteDance's 2021 overhaul of its Data Catalog system, the performance regressions encountered after switching to Apache Atlas, and the step‑by‑step backend optimizations—including JanusGraph tuning, Gremlin query refactoring, parallel processing, and write‑path improvements—that reduced latency from minutes to seconds.

ByteDance Data Platform

Jun 8, 2022

How ByteDance Optimized Data Catalog Performance with Apache Atlas and JanusGraph

In 2021 ByteDance rebuilt its Data Catalog system, switching the storage layer to Apache Atlas. The migration exposed severe performance regressions.

Background

Originally based on LinkedIn Wherehows and only supporting Hive, the system accumulated many patches, adding MySQL, Elasticsearch and an internal graph database (veGraph). This made the codebase hard to maintain and scale.

The new version retained all product capabilities but replaced the storage layer with Apache Atlas. Importing existing data caused drastic read/write latency and CPU spikes, e.g., writing metadata for a Hive table with >3000 columns saturated CPU at 100% and timed out after minutes; displaying details for a table with dozens of columns took over a minute.

Overall Optimization Approach

Before diving into details, we outline our methodology for optimizing business‑oriented web services: define clear goals, pinpoint bottlenecks, choose cost‑effective solutions, and verify improvements quickly.

Set realistic performance targets; avoid premature or excessive tuning.

Focus on the most impactful bottleneck to maximize ROI.

Select solutions with the best cost‑performance trade‑off.

Validate changes rapidly with automated tests or monitoring.

Specific Optimizations for Data Catalog

JanusGraph Configuration

Two JanusGraph settings proved critical:

query.batch=true

query.batch-property-prefetch=true

Enabling batch queries reduced the number of remote calls during metadata traversal.

Gremlin Query Refactoring

Original Gremlin query for counting downstream entity types took 2–3 seconds:

g.V().has('__typeName', 'BusinessDomain')
    .has('__qualifiedName', eq('XXXX'))
    .out('r:DataStoreBusinessDomainRelationship')
    .groupCount().by('__typeName')
    .profile();

After simplifying the traversal and moving the aggregation to the property level, execution time dropped to ~50 ms:

g.V().has('__typeName', 'BusinessDomain')
    .has('__qualifiedName', eq('XXXX'))
    .out('r:DataStoreBusinessDomainRelationship')
    .values('__typeName').groupCount().by()
    .profile();

Entity Graph Retrieval Adjustments

We modified mapVertexToAtlasEntity to fetch only required vertex properties, enabled multi‑pre‑fetch, limited relationship depth to one level, and allowed selective edge‑type retrieval, cutting detail‑page latency from ~1 minute to under a second.

Parallel Processing for Large Lineage Queries

For N‑layer lineage extraction we introduced a thread pool with few core threads and many max threads, processing each newly fetched vertex in parallel before aggregating results, turning minute‑scale operations into seconds.

Write‑Path Optimizations

Metadata tables with >3000 columns suffered from costly uniqueness checks on guid and qualifiedName. By removing the global‑unique check for guid and adding a Global_Unique index for __qualifiedName, write latency dropped from minutes to seconds.

Results

CPU usage normalized; write‑throughput for wide tables improved dramatically.

Read latency for high‑column tables reduced from >1 minute to <1 second.

Overall system stability increased, enabling smoother feature rollout.

Conclusion

Performance tuning of business‑focused backend services should start from concrete use‑case bottlenecks, apply inexpensive yet effective fixes, and verify impact continuously. Simple configuration tweaks, query rewrites, and selective data fetching often yield the biggest gains without costly infrastructure changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization metadata management JanusGraph Data Catalog Apache Atlas

Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.