Big Data 14 min read

ByteGraph: ByteDance’s Self‑Developed Graph Database – Architecture, Data Model, Query Language, and Operational Challenges

This article introduces ByteDance’s self‑developed graph database ByteGraph, covering its fundamentals, use‑case scenarios, data model and Gremlin query language, architecture and implementation details, and key challenges such as indexing, hot‑spot handling, resource allocation, high availability, and offline‑online data fusion.

DataFunTalk

May 30, 2022

ByteGraph: ByteDance’s Self‑Developed Graph Database – Architecture, Data Model, Query Language, and Operational Challenges

Guest: Chen Hongzhi, Ph.D., ByteDance Editor: Wang Liuyue, Shanghai University of International Business and Economics Platform: DataFunTalk

As a fundamental data structure, graph data appears in many scenarios such as social networks, risk control, recommendation, and protein analysis in bio‑informatics. Efficient storage, query, computation and analysis of massive graph data is a hot topic in the industry. This article introduces ByteDance’s self‑developed graph database ByteGraph, its internal applications and the challenges it faces.

Outline

Understanding graph databases

Application scenario examples

Data model and query language

ByteGraph architecture and implementation

Key problem analysis

01 Understanding Graph Databases

ByteDance currently operates three self‑developed graph data products. Compared with relational databases, graph models consist of vertices, edges and properties, enabling more efficient traversal and attribute filtering for queries such as “How many employees are in the company of Zhang San’s friend?”.

Graph databases have gained popularity in the past five years, with query languages like Cypher and the open‑source Gremlin. Modern graph databases are distributed, requiring solutions for data loss, replica consistency and sharding.

Some systems combine graph storage and graph computation; ByteDance currently uses two separate systems.

02 Application Scenario Examples

1. ByteGraph business data model – Initiated in 2018 to replace MySQL for storing user behavior and friend relationships on Toutiao, later extended to Douyin and other micro‑services.

2. Deployed business scenarios – Currently serving 1.5 × 10⁴ physical machines and over 600 business clusters.

03 Data Model and Query Language

1. Directed property‑graph modeling – Property graphs store attributes on vertices and edges, resembling a key‑value representation of relational tables (e.g., “User A follows User B”).

2. Gremlin query language interface – Gremlin is a Turing‑complete graph traversal language, easier for Python‑oriented analysts than SQL‑like languages. Example: retrieve one‑hop neighbors of User A whose fan count exceeds 100.

04 ByteGraph Architecture and Implementation

1. Overall architecture – Consists of a Graph Query Engine (GQ), a Graph Storage Engine (GS), and a disk storage layer. Computation and storage are separated; each layer runs a cluster of processes.

2. Read/write flow – Example read: client selects a GQ instance, discovers the target machine, forwards the request to the appropriate GS, which checks cache or pulls data from the KV store.

3. GQ implementation

Parser stage : recursive‑descent parser builds an abstract syntax tree.

Query‑plan generation : applies rule‑based (RBO) and cost‑based (CBO) optimizations to produce an execution plan.

Plan execution : pushes operators to GS partitions, merges results, and returns the final answer.

RBO leverages Gremlin’s built‑in rules, operator push‑down, and custom fusion; CBO quantifies costs based on vertex in/out degree statistics.

4. GS implementation

Storage structure – A partition (one source vertex + edge type) is stored as a B‑tree; each B‑tree has its own WAL and log‑id, facilitating GNN‑style sampling.

Log management – Single‑writer per B‑tree prevents concurrent write conflicts; WAL entries are written first, data is persisted during compaction.

Cache implementation

Graph‑native cache that understands graph semantics and supports partial push‑down for one‑hop queries.

High‑performance LRU cache with NUMA‑aware and cache‑line‑aware design, supporting Intel AEP.

Write‑through cache with configurable sync policies and negative‑cache support.

Separation of cache and storage enables independent scaling.

05 Key Problem Analysis

1. Indexing

Local index : built on a given source vertex and edge type to accelerate attribute filtering and sorting.

Global index : currently supports point‑attribute global lookup; consistency is ensured via distributed transactions.

2. Hot‑spot read/write

Hot‑spot reads (e.g., frequently refreshed video likes) are handled by multiple GQ instances and a copy‑on‑write GS, achieving >200 k QPS. Hot‑spot writes use group‑commit to batch writes into KV, reducing IOPS pressure.

3. Light/Heavy query resource allocation – Separate thread pools for lightweight (high‑frequency) and heavyweight (complex) queries; heavy pool can absorb light queries when idle.

4. High availability

Metro‑area dual‑datacenter deployment with low latency, follow‑one‑write‑many‑read strategy.

Wide‑area disaster‑recovery across Singapore and US, using binlog and hybrid logical clocks for ordering.

5. Offline‑online data stream fusion

Offline data is imported, online data is written, and both are integrated in the internal data platform for offline analytics.

Thank you for reading.

Share, like, and give a 3‑click boost!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability graph database Gremlin ByteGraph

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.