Databases 25 min read

How ByteGraph Powers TikTok’s Real‑Time Graph Queries: Architecture, Gremlin, and Scaling

This article details ByteGraph, the high‑performance graph database behind TikTok, covering its motivation over relational stores, data model, Gremlin query language, multi‑layer architecture, indexing, hot‑spot handling, and offline‑online data pipelines, while answering common technical questions.

dbaplus Community

Jul 6, 2021

How ByteGraph Powers TikTok’s Real‑Time Graph Queries: Architecture, Gremlin, and Scaling

Understanding Graph Databases

Traditional relational databases struggle with massive, highly connected data such as TikTok’s user‑follow, like, and contact graphs, leading to costly daily batch jobs, stale recommendations, and high latency. Graph databases store data as vertices, edges, and properties, enabling efficient traversals that reduce network and memory overhead.

Why Graph Over Relational

Unlike multi‑table joins, a graph query traverses relationships directly, offering finer‑grained message distribution and lower compute cost. For example, finding a user’s friends and the companies they work for can be expressed in a single Gremlin statement instead of complex SQL joins.

ByteGraph Evolution and Use Cases

ByteGraph, developed by ByteDance’s infrastructure team, now supports millions of QPS on a single machine, multi‑dimensional sorting (by follow time or relationship strength), and multi‑hop traversals across billions of edges. It powers TikTok’s recommendation engine, e‑commerce graph, knowledge graph, and service‑dependency mapping.

Typical scenarios include:

Online storage of user relationships for real‑time recommendation.

Knowledge‑graph queries for search and education.

Service‑dependency graphs for operational monitoring.

Data Model and Query Language

ByteGraph uses a directed property graph: each vertex has a unique (uid, app) pair and a type; edges have a source, target, type, and attributes (e.g., timestamp, location). Gremlin, a Turing‑complete graph query language, is the primary interface, allowing expressive queries such as filtering by vertex properties or edge attributes.

ByteGraph Architecture

The system is split into three independent layers that can be scaled horizontally:

Query Engine Layer : parses Gremlin strings, applies rule‑based (RBO) and cost‑based (CBO) optimizations, generates logical and physical plans, and dispatches sub‑queries to storage nodes.

Storage Engine Layer : maps graph partitions to a distributed KV store (Abase, Byte KV, HBase, RocksDB), organizes edges in B‑tree‑like pages, supports write‑ahead logging (WAL) and 1‑PC read‑committed transactions.

Disk Storage Layer : persists KV pages; future versions will use a native graph store.

Read/Write Flow

Writes are first logged to WAL, applied to in‑memory B‑tree pages, and later flushed to disk, minimizing write amplification. Reads locate the appropriate storage node, fetch the page from cache or disk, and return results.

Indexing

ByteGraph provides both local (per‑vertex‑type) and global indexes. Local indexes accelerate queries on edge attributes (e.g., age, timestamp). Global indexes map attribute values to all matching vertex IDs, maintained with distributed transactions for consistency.

Hot‑Spot Mitigation

Super‑nodes (e.g., a celebrity with millions of followers) are split into multiple edge pages once a threshold (e.g., 2 000 edges) is exceeded, preventing write and read bottlenecks.

Offline/Online Data Integration

Bulk data from MySQL, Hive, Redis, HBase is ingested via internal MapReduce pipelines. Real‑time writes arrive through Gremlin SDKs or Kafka. Daily snapshots are exported to Hive for offline analytics and model training.

Key Technical Q&A

Gremlin vs. GSQL : Gremlin is more pipeline‑oriented and natural‑language‑like, while GSQL resembles SQL and may become a standard.

High Availability : Currently not supported.

Graph Computations : Not provided by ByteGraph; a separate system handles tasks like triangle counting.

Why Not DGraph : ByteDance’s massive, globally distributed workloads exceed the capabilities of existing open‑source solutions.

Super‑Node Handling : B‑tree page splitting keeps performance stable even for nodes with >100 M edges.

OLTP vs. OLAP : ByteGraph serves both, with distinct optimizations for each.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

graph database Scalable Architecture Distributed storage Gremlin real-time recommendation TikTok ByteGraph

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.