Distributed Parallel Multi‑Hop Query Optimization in REDgraph Graph Database
To meet Xiaohongshu’s strict P99 latency targets for multi‑hop graph queries, the REDgraph team built a distributed parallel execution framework that eliminates global barriers, pushes operators to storage nodes, uses routing operators and a NeighborCache, cutting three‑hop query latency by over 50 % and enabling production deployment.
Multi‑hop queries are essential for deep data insight in Xiaohongshu’s online services, but they often fail to meet strict P99 latency requirements. To address this, the storage team built a distributed parallel query framework on top of their graph database REDgraph, reducing multi‑hop latency by more than 50% and making three‑hop queries viable in production.
The article first introduces the business need for graph‑based multi‑hop queries in scenarios such as community recommendation, risk control, and e‑commerce, and explains why traditional SQL or KV solutions are inefficient.
REDgraph’s architecture follows a shared‑nothing, compute‑storage separation model with three node types: Meta service (metadata, schema, permissions), Graph service (query parsing, optimization, scheduling, stateless execution), and Storage service (graph API, Raft‑based replication, RocksDB engine). The system adopts edge‑partitioning to distribute billions of vertices and edges across the cluster.
Initial optimization (Version 1.0) focused on custom algorithms, fan‑out limiting, operator push‑down, read‑only replicas, and GC tuning, but suffered from low generality. The new design draws on MPP concepts and introduces a fully distributed parallel execution pipeline:
Eliminate global barriers in the query layer to avoid blocking on slow storage nodes.
Parallelize most operators (Project, InnerJoin, etc.) across storage nodes instead of serial execution on the query server.
Introduce routing operators (FORWARD, CONVERGE, MERGE) to move parts of the execution plan to storage nodes.
Implement a NeighborCache on storage nodes to deduplicate start‑vertex IDs and avoid repeated neighbor lookups. The cache is a map like map<vid+edgetype, list> with short TTL and memory‑aware eviction.
Balance load by selecting the least‑loaded replica for reads and by expanding cluster size when hotspoting occurs.
Use an Acker mechanism with XOR‑based stage counters to track distributed stage progress and enable safe stage transitions without a global barrier.
Performance tests using the LDBC SNB dataset (SF100, 300 M vertices, 1.8 B edges) show that one‑ and two‑hop queries have comparable latency to the original engine, while three‑hop queries achieve 50‑60% latency reduction, keeping P99 under 50 ms for high‑degree cases and under 200 ms for limited cases. Four‑hop queries still take seconds, but the new framework still yields 50‑70% improvement.
The solution is framework‑level, requires no changes to business logic, and is planned for rollout across multiple online services in the second half of the year. It also provides valuable insights for other database products within the company.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.