How Do We Analyze Influence and Spam on Sina Weibo? Algorithms Explained
This article introduces a range of algorithms for Sina Weibo—including tag propagation, user similarity via LDA, time‑aware weighting, community detection, PageRank‑based influence ranking, and spam user identification—to illustrate how social network analysis can uncover user interests, influence, and malicious behavior.
Weibo is a widely used social platform where users regularly create original posts, repost, reply, read, follow, and mention others. Understanding user interests, influence, and detecting spam requires a suite of algorithms that model both content and network structure.
Tag Propagation
Each user is assigned one or more interest tags. The basic assumption is that a user's friends or followers share the same interests. The algorithm iteratively updates tags based on the most frequent tags among a user's connections, optionally weighting friends and followers differently.
Initialize tags for a subset of users.
For each user, count the tags of their friends and followers and assign the most frequent tag(s).
Repeat step 2 until tag assignments stabilize.
User Similarity Calculation
When the simple tag‑propagation assumption fails, similarity between users is computed. All of a user's posts are aggregated and represented as a bag‑of‑words vector; similarity can be measured with cosine distance or KL divergence. A more sophisticated method uses LDA to obtain a topic distribution for each user, then compares these distributions.
LDA Generation Process
For each document, draw a topic from the document's topic distribution.
From the chosen topic, draw a word according to the topic's word distribution.
Repeat steps 1 and 2 until the document is fully generated.
The resulting topic vectors are used with cosine or KL distance to weight the tag‑propagation step.
Time and Network Factors
Interests evolve over time, so similarity should consider recent posts. Selecting the N most recent posts (e.g., the latest 50) for each user before LDA training captures temporal dynamics. Additionally, interaction types such as reposts, replies, and mentions provide extra network signals: higher repost or @ frequency between two users suggests greater similarity.
Community Detection
Communities are groups of tightly connected users. Two similarity measures are introduced:
Common‑friend similarity (Jaccard of friend sets).
Common‑follower similarity (Jaccard of follower sets).
These measures, combined with shortest‑path similarity, can be fused (e.g., weighted sum) and fed into clustering algorithms such as K‑Means or DBSCAN to obtain community clusters.
Influence Calculation
Borrowing from PageRank, influence is propagated through the follower network. The algorithm iteratively distributes influence weight from each user to the users they follow until convergence.
Assign equal initial influence to all users.
Distribute each user's influence equally among the users they follow.
Update each user's influence as the sum of contributions from their followers.
Repeat steps 2–3 until the influence scores stabilize.
Additional factors—such as activity level, post quality (repost and reply counts), and interaction networks (reply, repost, @)—can be incorporated to refine the ranking.
Topic and Domain Factors
Influence scores can be applied to specific topics. By retrieving posts related to a hotspot topic (using hashtags or LDA‑derived topics) and running the influence algorithm, one can identify opinion leaders for that topic or domain.
Spam User Identification
Spam accounts exhibit distinctive patterns: regular posting intervals (low entropy), high @‑mention ratios, excessive URLs, and mismatched content between posts and linked pages. Structural cues such as abnormal follower‑friend ratios and lack of triadic closure also help. These features can be fed into classifiers (logistic regression, decision trees, Naïve Bayes) to flag spam users, and a PageRank‑style propagation can further estimate spam probabilities.
Conclusion
The presented algorithms provide a foundation for analyzing Sina Weibo data. While real‑world systems are more complex, the discussed methods—tag propagation, similarity via LDA, time‑aware weighting, community detection, influence ranking, and spam detection—demonstrate how social‑network analysis can uncover hidden patterns and improve platform services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
