How JanusGraph and Spark GraphX Unlock Value Users in 58 Tribe’s Social Network
This article details how 58 Tribe built a large‑scale graph database with JanusGraph, integrated it with Spark GraphX to compute degree, closeness and betweenness centralities, optimized batch imports, identified cheating and high‑value users, and achieved significant performance gains for social network analysis.
Background and Motivation
As the 58 Tribe social platform grew to millions of users, the need to discover valuable users, analyze their social relationships, and detect fraudulent behavior became increasingly complex. Traditional relational databases could not handle the massive graph structure and hierarchical relationships, prompting the adoption of a dedicated graph database and large‑scale graph analytics.
Graph Database Survey
A technical survey compared three popular graph databases—Neo4j, JanusGraph and HugeGraph—across storage scalability, engine type, transaction support, partitioning, full‑text search, indexing and other features. JanusGraph was chosen because it supports distributed storage, integrates with multiple back‑ends (HBase, Cassandra, MySQL) and offers robust indexing via Elasticsearch.
Technical Choice
While both Neo4j and JanusGraph provide strong query capabilities, Neo4j lacks distributed architecture. JanusGraph, combined with Spark for large‑scale computation, satisfied the requirements for handling billions of edges and running custom graph algorithms.
Social Network Centrality Metrics
Three centrality measures were employed to label users:
Degree Centrality : number of direct connections (e.g., followers, likes).
Closeness Centrality : inverse of the sum of shortest‑path distances to all other nodes.
Betweenness Centrality : count of shortest paths that pass through a node.
High degree often indicates a “big V” (popular user), while high closeness and betweenness suggest influential connectors.
Degree Centrality Implementation
Spark GraphX provides the degrees API, as well as outDegrees and inDegrees, to compute degree centrality directly on the user graph.
Closeness Centrality Implementation
The ShortestPaths algorithm (based on Dijkstra) calculates the sum of shortest distances from each vertex. To reduce memory pressure on massive graphs, a set of “key nodes” (top‑1000 users by degree per category) was selected, and distances were computed only to these nodes, dramatically lowering intermediate state size.
Betweenness Centrality Implementation
Because GraphX lacks a built‑in betweenness algorithm, the ShortestPaths message‑passing framework was adapted. The algorithm stores edge information for each iteration, enabling the counting of shortest‑path occurrences that traverse a given vertex. For very large graphs, a K‑step approximation was introduced, counting only shortest paths of length ≤ K, which reduced memory usage while preserving ranking consistency.
JanusGraph Architecture and Cluster
The production cluster connects JanusGraph to HBase for persistent storage and Elasticsearch for indexing. The schema defines:
Node label User with properties node_id, age, name, degree, closeness, betweenness.
Edge labels FOLLOW, LIKE, COMMENT with properties date, values.
Figures illustrate the overall JanusGraph architecture and the cluster component diagram.
Batch Import Enhancements
Initial imports via the JanusGraph server were slow for large volumes. A custom import tool, based on IBM’s janusgraph‑utils, was extended to:
Connect directly to HBase and Elasticsearch, bypassing the HTTP server.
Submit data in transactional batches.
Utilize multiple workers for parallel writes.
Automatically generate schema and indexes from a configuration file.
Performance tests showed a reduction of job runtime from over two hours (OOM‑prone) to roughly thirty minutes, with a clear decrease in executor memory consumption.
User Tagging and Detection
Users were classified into normal, cheating, intermediary and high‑value groups. Cheating users exhibit unusually high degree but low closeness and betweenness, while high‑value users score highly on all three metrics. The system automatically flags these users for punitive or promotional actions, improving community health.
Results and Visualization
The integrated JanusGraph‑Spark solution successfully identified valuable and fraudulent users, generated centrality scores, and visualized the user graph. Comparative charts demonstrate the speedup of the new import pipeline and the memory/GC improvements after algorithm refinements.
Conclusion and Outlook
Combining JanusGraph with Spark GraphX enabled scalable social‑network analysis for 58 Tribe, delivering actionable insights and performance gains. Future work includes exploring additional graph algorithms (community detection, link prediction) and extending the framework to support richer user tagging and recommendation scenarios.
References
JanusGraph documentation: https://docs.janusgraph.org/
Apache Spark: http://spark.apache.org/
Centrality metrics paper: https://www.researchgate.net/publication/222405203_A_Graph-Theoretic_Perspective_on_Centrality
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
