Weibo Social Network Analysis: Label Propagation, Similarity Measures, Community Detection, Influence Ranking and Spam User Identification
The article presents a comprehensive overview of algorithms for analyzing Weibo’s social network, covering label propagation, user similarity via LDA, temporal and interaction factors, community detection, influence ranking using PageRank variants, and methods for identifying spam accounts.
Introduction
Weibo is a widely used social platform where users perform actions such as posting original content, reposting, replying, reading, following, and mentioning others. These behaviors generate rich data that can be mined to understand user interests, influence, and the underlying network structure.
Label Propagation
Assuming that a user’s friends or followers sharing the same interest dominate, a label propagation algorithm assigns each user the most frequent label among its neighbors, optionally weighting friends and followers differently.
Initialize a subset of users with seed labels.
For each user, count the labels of its friends and followers and assign the most frequent one(s).
Repeat step 2 until label changes stabilize.
User Similarity Calculation
When the simple label assumption fails, similarity between users is used to weight neighbor contributions. By aggregating all of a user’s posts and representing them with a bag‑of‑words model, one can compute similarity via cosine distance, or more robustly with LDA (Latent Dirichlet Allocation) to obtain topic distributions and then compare them using cosine or KL‑divergence.
Temporal and Network Factors
Interest evolves over time, so recent posts (e.g., the latest 50) should be preferred when training LDA. Interaction frequencies such as retweets and @mentions also indicate stronger similarity and can be incorporated as additional weighting factors.
Community Detection
Communities are defined as tightly connected groups where members share high interest similarity and are within two hops of each other. Similarity measures include shortest‑path inverse, common‑friend (co‑follow) Jaccard similarity, and common‑follower Jaccard similarity. These measures can be combined and fed into clustering algorithms such as K‑Means or DBSCAN, or used to drive a weighted label‑propagation process.
Influence Calculation
Borrowing from PageRank, a user’s influence is propagated through the follower graph: each user distributes its current influence equally to the users it follows, and a user’s new influence is the sum of received contributions. Iteration continues until convergence. Extensions incorporate activity metrics (post frequency, retweet/reply counts) and interaction networks (retweet, reply, @ graphs) to produce richer influence scores.
Topic and Domain Factors
Influence scores can be applied to specific topics by first identifying posts related to a hot topic (via hashtags or LDA‑derived topics) and then ranking the participating users. Similarly, influence within a domain (e.g., IT) can be obtained by restricting the analysis to users labeled with that domain.
Spam User Identification
Spam accounts exhibit regular posting patterns (high entropy), excessive @ mentions, many URLs, and a high proportion of advertising content. Structural cues such as abnormal follow‑to‑follower ratios and missing triadic closure also help. These features feed into classifiers (logistic regression, decision trees, Naïve Bayes) and can be further refined with a PageRank‑style propagation of spam probabilities.
Conclusion
The discussed algorithms provide a foundation for mining Weibo data, though real‑world systems often combine many of these techniques and extend them with recommendation, hot‑topic tracking, and other advanced functionalities.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.