Big Data 19 min read

Social Network Analysis on Weibo: Label Propagation, User Similarity, Community Detection, Influence Ranking, and Spam User Identification

This article introduces a series of algorithms for analyzing the Weibo social network, including label propagation, LDA‑based user similarity, time‑aware and interaction‑aware similarity measures, community detection, influence ranking via PageRank variants, and methods for identifying spam users, illustrating how these techniques can be applied to large‑scale social media data.

Art of Distributed System Architecture Design

Jun 16, 2015

Social Network Analysis on Weibo: Label Propagation, User Similarity, Community Detection, Influence Ranking, and Spam User Identification

Weibo is a widely used social application where daily users perform actions such as creating original posts, reposting, replying, reading, following, and mentioning others. The first four actions target short posts, while following and mentioning relate to user relationships; following makes one a fan, and mentioning signals a desire for the user to see the post.

Weibo is considered a "self‑media" platform where ordinary users share news related to themselves. Recently, many people have leveraged their influence on such platforms for profit, raising the question of how personal influence on Weibo is calculated and what hidden algorithms manage our behavior.

From a social‑computing perspective, the characteristics of the Weibo network can inspire insights into real‑world social networks. Social network analysis has become a popular data‑mining topic, and this article briefly introduces several relevant algorithms that may also apply to other social platforms.

Label Propagation

Weibo has a massive user base with diverse interests. Tagging users with interest labels helps improve ad targeting and content recommendation. The first assumption is that a user’s friends (or followers) who share the same interests constitute the majority.

Each user’s friends/followers with the same interest dominate.

This leads to the label propagation algorithm: each user adopts the most frequent label(s) among their friends and followers, possibly weighting friends and followers differently. The process is:

Assign initial labels to a subset of users.

For each user, count the labels of friends and followers and assign the most frequent label(s) to the user.

Repeat step 2 until labels stabilize.

User Similarity Calculation

The label propagation algorithm is simple but fails when the assumption does not hold—for example, users often follow friends out of politeness, not shared interests. To address this, we compute similarity between users to weigh the contribution of friends' or followers' labels.

The more similar a friend/follower is to a user, the more likely their label reflects the user’s interest.

Similarity can be measured using the content of a user’s posts (original and reposted). By aggregating all posts of a user, we can represent them as a bag‑of‑words vector and compute cosine similarity, though this is simplistic. A more sophisticated approach uses LDA (Latent Dirichlet Allocation) to obtain a topic distribution for each user’s posts, then measures similarity via cosine, KL‑divergence, etc.

LDA generates a three‑layer probabilistic model (document → topic → word). After estimating the topic distribution for each user, similarity between users is derived from the distance between their topic vectors.

Time and Network Factors

Users’ interests evolve over time, so aggregating all historical posts may be unreasonable. One can select the most recent N posts (e.g., the latest 50) for each user when training LDA, or adapt N per user based on posting patterns.

Beyond static similarity, interaction frequencies such as reposts and mentions provide additional signals. The following assumptions are introduced:

The higher the frequency a user reposts a friend’s posts, the greater their interest similarity.

The higher the frequency a user mentions a friend, the greater their interest similarity.

These factors can be quantified (e.g., as weights) and incorporated into the similarity calculation.

Community Detection

A Weibo community consists of tightly connected users. Two criteria define a community: high interest similarity among members and short relational distance (e.g., no more than two hops between any two members).

Relational similarity can be approximated by the inverse of the shortest path length in the directed follower graph, but this yields only a few discrete values. Additional implicit measures include:

The more common friends two users share, the higher their relational similarity.

The more common followers two users share, the higher their relational similarity.

These can be quantified using Jaccard similarity (intersection over union). Combining shortest‑path similarity, common‑friend similarity, and common‑follower similarity via a weighted function yields a final similarity score, which can be fed into clustering algorithms such as K‑Means or DBSCAN to obtain community clusters. A weighted label propagation can also be used to form communities.

Influence Calculation

Beyond community detection, influence ranking is a key application. Inspired by PageRank, the assumption is that a user followed by high‑influence users also has high influence.

The PageRank‑style influence algorithm for the follower network proceeds as:

Initialize all users with equal influence weight.

Distribute each user’s weight equally among the users they follow.

A user’s influence equals the sum of weights received from their followers.

Iterate steps 2 and 3 until convergence.

Other link‑based algorithms such as HITS or HillTop can also be adapted. However, pure link‑based influence may over‑emphasize users with many followers, ignoring activity quality. Therefore, additional factors—posting frequency, retweet count, reply count—should be combined with the link‑based score.

Similarly, separate influence scores can be computed for reply, repost, and mention networks, then merged via weighted summation.

Topic and Domain Factors

With influence scores, one can analyze current hot topics to identify opinion leaders. For posts without explicit topic tags, LDA can infer the dominant topic of each short post (≤140 characters) and associate users with topics.

Running influence ranking on users participating in a topic yields the most influential contributors, useful for public opinion monitoring. Influence ranking within a specific label also provides domain‑level leaderboards (e.g., top influencers in IT).

Spam User Identification

Spam (or “zombie”) users distort influence calculations. Detecting them improves both accuracy and efficiency. Spam detection considers both user attributes and link structure.

Typical spam characteristics include:

Regular posting patterns measurable by entropy; higher entropy indicates more regular timing, suggesting spam.

Excessive use of @ mentions for advertising.

High proportion of URLs in posts; content mismatch between post text and linked page can be detected via bag‑of‑words similarity.

Large fraction of posts classified as advertisements via text classification.

Unusual follower‑to‑friend ratios and low occurrence of triadic closure (follower triangles) compared to normal users.

These features can feed a machine‑learning classifier (e.g., logistic regression, decision tree, Naïve Bayes) to predict spam likelihood.

Additionally, link‑based analysis can be applied: normal users rarely follow spam accounts. By initializing spam probabilities (spam = 1, normal = 0) and propagating them through a PageRank‑like process with appropriate normalization, a refined spam probability for each user can be obtained.

Disclaimer: The content is sourced from public internet channels, presented neutrally for reference and discussion only. Copyright belongs to the original authors or institutions; please contact for removal if infringement occurs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data label propagation spam detection Social Network Analysis influence ranking user similarity

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.