How Ctrip Scaled Nebula Graph: Architecture, Blue‑Green Deployment, and Performance Tuning
This article details Ctrip's adoption of Nebula Graph, covering the motivations, distributed architecture, three deployment patterns, Kubernetes‑based operators, client session management, blue‑green read/write splitting, extensive performance tuning, and future roadmap for their production graph database platform.
Background
With the rapid growth of data volume and complexity, Ctrip needed a technology that could model and query highly connected data efficiently, leading to increased interest in graph databases.
Why Nebula Graph
Open‑source version provides horizontal scalability.
Native storage layer offers better performance than solutions built on third‑party stores.
Supports Cypher, easing migration from Neo4j.
Active community since 2019 with major internet companies participating.
Clear codebase and low technical debt, suitable for further development.
Nebula Graph Architecture & Cluster Deployment
Nebula Graph follows a compute‑storage separation design composed of three services: graphd (computing), metad (metadata), and storaged (graph data). Three deployment modes are used:
Three‑datacenter deployment : provides fault tolerance across data centers but incurs cross‑datacenter latency.
Single‑datacenter deployment : avoids cross‑datacenter latency but loses availability if the datacenter fails.
Blue‑green (active‑passive) deployment : combines the advantages of the above by allowing traffic shifting and supporting read/write separation for high‑performance core services.
Middleware and Operations Management
The team built a Kubernetes CRD and Operator to manage Nebula Graph clusters, integrated deployment UI pages, and used a sidecar to collect core metrics, sending them via Telegraf to Ctrip's Hickwall monitoring system with alerts.
Automatic internal domain name allocation was added to keep stable node addresses within the cluster.
Client Enhancements
Session Management : Introduced a Session Pool that queues Session objects, allowing borrow‑and‑return semantics, pre‑generation of sessions, and dynamic scaling based on configuration changes.
Blue‑Green with Read/Write Splitting : Modified the client to route reads to both clusters while writes go to the primary, enabling seamless traffic switching.
Traffic Allocation : Implemented weighted round‑robin routing based on per‑IDC Session Pools, avoiding Virtual IP and reducing forwarding overhead.
Structured Query Builder (procedure‑style DSL):
Builder.match()
.vertex("v")
.hasTag("user")
.property("name", "XXX", DataType.String())
.edge("e", Direction.OUTGOING)
.type("follow")
.type("serve")
.vertex("v2")
.ret("v2", "Friends")System Tuning Practices
Hotel cluster instability : Fixed mis‑aligned metad_server_address and added session timeout configuration ( session_idle_timeout_secs=86400) and increased session_reclaim_interval_secs to 30 s, reducing session metadata buildup.
High CPU on storaged :
Balanced data shards across nodes to mitigate dense‑point hotspots.
Performed RocksDB compaction and increased rocksdb_block_cache to 8192, enabled prefix filtering, and disabled auto‑compactions where appropriate.
Adjusted RocksDB parameters ( write_buffer_size=134217728, max_background_compactions=4) to reduce write amplification.
Lock competition :
Reduced thread pool sizes ( num_io_threads, num_worker_threads, reader_handlers) to lower contention.
Switched block cache to ClockCache (experiment) but abandoned due to stability issues.
Disabled block cache and index/filter block caching, set max_open_files=-1, and turned off compression on lower levels, which stabilized CPU usage below 30 % even under doubled traffic.
Service down‑time during bulk writes : Limited maximum statement count from 500 to 200, avoiding a Nebula Graph bug that caused recursive execution and graphd crashes.
Nebula Graph Secondary Development
Added operational commands for shard migration and leader transfer, facilitating manual interventions when needed.
Future Plans
Integrate with Ctrip's big‑data platform (Spark/Flink) for ETL and cross‑cluster data migration.
Provide Slowlog inspection to capture slow queries.
Introduce parameterized queries to prevent injection.
Enhance visualization with custom dashboards.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
