Black-Gray Industry Attack Detection Based on Community Encoding Using Graph Embedding
The paper introduces a community‑encoding, GraphSAGE‑based detection framework that embeds whole user‑account, IP, device, and phone‑number graphs—both homogeneous and heterogeneous—to identify previously unseen black‑gray industry attacks, achieving about 95% IP‑risk accuracy via an asynchronous near‑real‑time system, though computational and automation challenges persist.
This article presents a novel method for identifying black-gray industry (黑灰产) attacks using community encoding and large-scale graph embedding representation learning. As internet black-gray industries have evolved into platform-based, specialized, and refined operations, traditional detection methods face challenges in identifying unknown attacks.
The proposed approach combines graph-based community discovery with GraphSAGE embedding techniques. The method operates on both homogeneous graphs (where all nodes are of the same type, such as user account IDs) and heterogeneous graphs (where nodes can be different types like account IDs, IP addresses, device IDs, and phone numbers).
GraphSAGE algorithm consists of three main steps: (1) sampling a fixed number of neighbors for each node to ensure computational efficiency, (2) aggregating neighbor information using functions like mean aggregation, and (3) generating vector representations for downstream tasks. The method uses a 2-layer sampling approach with up to 200 neighbors per layer, making it suitable for large-scale datasets.
For engineering implementation, the system uses an asynchronous near-real-time architecture with a 10-minute staging area for recent requests. Offline partition logs are used for graph construction, community mining, and model training, while the staging area enables real-time prediction using the trained classification model.
Key innovations include: encoding entire community structures into representation vectors rather than relying on individual node attributes, enabling identification of previously unseen black-gray industry accounts through network structure similarity, and combining large-scale graph embedding with asynchronous prediction for practical deployment. The system achieved approximately 95% accuracy in IP dimension risk identification.
Challenges remain in computational resource requirements, feature selection for graph algorithms, and achieving fully automated detection without human intervention.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.