Design and Optimization of REDgraph: Distributed Parallel Multi‑hop Query for Large‑Scale Social Graphs
This article presents the design, challenges, and performance‑focused optimizations of REDgraph, a large‑scale graph database used at Xiaohongshu, detailing its architecture, edge‑partitioning strategy, distributed parallel query implementation, and experimental results that demonstrate significant latency reductions for multi‑hop queries.
REDgraph is a graph database developed to handle the massive social graph of Xiaohongshu, improving query efficiency by leveraging graph‑native storage and processing for relationship‑intensive workloads.
The presentation outlines five parts: background introduction, analysis of the original architecture, distributed parallel query implementation, optimization strategies, and a summary with future outlook.
Graph databases excel over traditional relational databases for multi‑hop queries, as demonstrated by a Gremlin example that reduces a complex three‑join SQL to a concise four‑line traversal.
In Xiaohongshu, graph queries support real‑time recommendation, community risk control, and offline task scheduling, but three‑hop queries suffered from high latency (P99 > 50 ms for recommendation, >200 ms for risk control).
Analysis shows three‑hop queries combine OLTP‑level concurrency with much larger data access and moderate computational complexity, making them difficult to optimize without sacrificing latency.
The original REDgraph architecture follows a NewSQL‑like separation of meta‑server, query‑server, and store‑server, using edge‑partitioning (hash‑based) to distribute graph data.
Initial optimizations (version 1.0) focused on custom algorithms, fan‑out limiting, operator push‑down, read‑only replicas, and GC tuning, but suffered from limited generality.
A new distributed parallel query framework was introduced, moving execution control to store servers, enabling operator parallelism, removing synchronization barriers, and allowing direct result forwarding.
Additional enhancements include a short‑lived NeighborCache to avoid duplicate neighbor lookups, load‑balancing via replica selection and cluster scaling, and lightweight flow‑control using ACK‑based stage tracking.
Performance tests on the LDBC‑SNB SF100 dataset (≈300 M vertices, 1.8 B edges) show that one‑ and two‑hop queries retain baseline performance, while three‑hop queries achieve 50‑60 % latency reduction, keeping P99 under 200 ms; four‑hop queries still benefit from 50‑70 % improvement despite remaining in the second‑second range.
The work demonstrates that applying MPP concepts to a graph database can unlock previously infeasible multi‑hop queries, delivering substantial speedups and a scalable path forward for ever‑growing social data workloads.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.