Artificial Intelligence 12 min read

Graph-Based Anti-Fraud: Gang Mining and Node Representation for Account Security

This article describes how Baidu's account security team leverages large‑scale graph technology and graph neural networks to detect and characterize black‑industry cheating gangs, presents a customized GraphSAGE link‑prediction model, and evaluates its superiority over MLP and GCN embeddings for downstream risk‑control tasks.

Baidu Intelligent Testing

Oct 19, 2021

Graph-Based Anti-Fraud: Gang Mining and Node Representation for Account Security

With over 1.01 billion Chinese net users, the rapid growth of online services has spawned a massive black‑industry ecosystem that conducts large‑scale, industrialized cheating across account systems, e‑commerce, and financial fraud, causing significant financial loss and user‑experience degradation.

To combat these threats, Baidu's account security strategy team built a graph‑based anti‑fraud framework that can process billions of nodes and edges, supporting multi‑granular (daily, weekly, monthly) and heterogeneous graph structures to uncover hidden cheating gangs.

Traditional gang‑mining methods rely on statistical feature filtering, which often fails to reveal the full scope of a cheating organization. By converting account‑feature‑device relationships into graph structures (see

), the team can identify clusters of suspicious accounts, though exhaustive extraction still demands substantial effort.

The resulting association‑graph architecture (illustrated in

) supports billions of nodes and edges, offering extensibility through simple configuration to adapt to new business scenarios.

However, hard associations (e.g., shared devices) can produce false positives, and dirty data or long‑term cross‑resource interactions may generate massive, noisy sub‑graphs that include legitimate accounts.

To improve discrimination, the team explored node‑embedding techniques such as DeepWalk, LINE, node2vec, as well as graph neural networks (GCN, GAT, GraphSAGE). They adapted GraphSAGE for link‑prediction on account graphs: a two‑layer GraphSAGE aggregates two‑hop neighbor information, combines the representations of a pair of target nodes, and predicts a link using a sigmoid of their dot product.

The model architecture is shown in

. The final prediction score is computed as score = \sigma(e_i \bullet e_j) (Equation 5).

Comparative experiments with MLP and GCN embeddings visualized via T‑SNE/UMAP (see Figures 5‑7) demonstrate that GraphSAGE‑sum produces markedly tighter clusters for the same gang labels, indicating superior discriminative power.

In downstream tasks such as gang classification, augmenting a baseline XGBoost model with the learned node vectors raised accuracy above 90 %, suggesting further gains with full‑scale training.

Future work includes designing downstream tasks for massive gangs, improving the efficiency of node‑embedding generation under GPU constraints, and advancing graph sampling, visualization, and real‑time processing techniques.

References:

[1] Perozzi B, Al‑Rfou R, Skiena S. DeepWalk: Online learning of social representations. KDD 2014. [2] Tang J et al. LINE: Large‑scale information network embedding. WWW 2015. [3] Grover A, Leskovec J. node2vec: Scalable feature learning for networks. KDD 2016. [4] Kipf TN, Welling M. Semi‑supervised classification with graph convolutional networks. arXiv 2016. [5] Veličković P et al. Graph attention networks. arXiv 2017. [6] Hamilton WL, Ying R, Leskovec J. Inductive representation learning on large graphs. NeurIPS 2017. [7] Ying R et al. Graph convolutional neural networks for web‑scale recommender systems. KDD 2018. [8] Chen T et al. XGBoost: extreme gradient boosting. R package 2015.

machine learning anti-fraud graph neural networks risk control Node Representation

Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.