CStore: A Native Graph Storage Engine for Large-Scale Graph Analysis
CStore is a Rust‑implemented native graph storage engine designed for large‑scale graph analysis, supporting petabyte‑level data, offering array‑plus‑linked‑list storage, multi‑level indexing, efficient compaction, and providing detailed build instructions and future roadmap for open‑source development.
CStore is a native graph storage engine written in Rust, optimized for graph analysis scenarios. It can store graphs with billions of vertices and trillions of edges, and has been used internally at Ant Group for petabyte‑scale workloads.
Graph storage engines can be classified into several types based on their storage mechanisms, such as list‑based (e.g., Neo4j), hash‑plus‑list (e.g., ArangoDB), key‑value (e.g., Titan/JanusGraph/HugeGraph), and traditional relational (e.g., AgensGraph). CStore adopts an Array+linked‑list structure.
In CStore, graph data is modeled as an attribute graph consisting of vertices and edges, each with associated properties. Input vertex IDs are normalized to 4‑byte IDs ("ID化"), reducing memory consumption and enabling O(1) primary‑key lookups via an array index.
After ID conversion, data is serialized into a PrimaryKey and GraphData. The PrimaryKey stores the vertex/edge ID, while GraphData contains a SecondKey (graph metadata) and Property (attributes). SecondKey encodes variable‑size target IDs, timestamps, and other meta‑information.
The storage layout offers two main benefits: fixed‑size binary encoding reduces memory usage, and separating graph metadata from properties allows independent indexing and reduces compaction overhead.
CStore builds multi‑level indexes: a partition index (based on start‑vertex ID and write time), a sparse primary‑key index (default interval 2048, resident in memory), and secondary indexes (min‑max, Bloom filter) that accelerate hotspot queries by filtering on label, timestamp, etc.
Data is written to disk in GraphData Segments. When a segment reaches a threshold, it is flushed as a partition, with data for the same start‑vertex ID stored contiguously. Sorted SecondKeys are indexed, while Properties are stored in separate attribute files.
Index files ("is" files) follow an LSM‑Tree structure: each level contains sorted key‑value blocks, enabling efficient reads and writes. CStore supports single‑layer multi‑threaded compaction, where multiple compactor threads work in parallel to merge files.
Compilation of CStore requires Rust and a C++ toolchain. Example installation commands:
# install rust.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# install nightly toolchain.
rustup update && rustup toolchain install nightly && rustc --version
# install other dependencies.
yum install make gcc gcc-c++ protobuf-devel protobuf clangAfter cloning the TuGraph‑Analytics repository, the CStore source can be built with:
git clone https://github.com/TuGraph-family/tugraph-analytics.git
cd tugraph-analytics/geaflow-cstore && make buildThe provided Makefile offers targets such as build-dev , build-release , fmt , clippy , test-all , bench-all , doc , and others for code style checks, testing, benchmarking, and documentation generation.
Future plans for CStore include deeper graph‑analysis optimizations, columnar storage, graph fusion, materialized views, and a remote compactor to alleviate resource contention during compaction. The project is fully open‑source under the TuGraph‑Analytics repository, and contributors are welcomed.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.