How to Build a Python Weibo Hot Search Scraper and Automate Data Collection
This tutorial walks you through creating a Python web scraper that logs into Weibo, extracts the real‑time hot search list and top comments using requests, lxml and pandas, saves the data to Excel, and schedules the script to run hourly with Windows Task Scheduler.
Page Analysis
Hot Search Page
The hot search list is available at https://s.weibo.com/top/summary?cate=realtimehot. After logging in, open the developer tools (F12), refresh (Ctrl+R), and capture the request Cookie and User-Agent. Use a tool such as Google to obtain the XPath expressions for the required elements.
For the detail page, similar steps are used to capture comment time, user name, forward count, comment count, like count, and comment content.
Scraping Code
First import the required modules.
import requests
from time import sleep
import pandas as pd
import numpy as np
from lxml import etree
import reDefine global variables. headers: request headers all_df: DataFrame to store the collected data
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36',
'Cookie': '''your Cookie here'''
}
all_df = pd.DataFrame(columns=['Rank', 'Hot', 'Title', 'Comment Time', 'User Name', 'Forward Count', 'Comment Count', 'Like Count', 'Comment Content'])The function get_hot_list(url) fetches the hot‑search page, extracts each entry’s rank, hotness, title, and detail URL, then calls get_detail_page for further processing.
def get_hot_list(url):
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
tr_list = tree.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr')
for tr in tr_list:
parse_url = tr.xpath('./td[2]/a/@href')[0]
detail_url = 'https://s.weibo.com' + parse_url
title = tr.xpath('./td[2]/a/text()')[0]
try:
rank = tr.xpath('./td[1]/text()')[0]
hot = tr.xpath('./td[2]/span/text()')[0]
except:
rank = 'Pinned'
hot = 'Pinned'
get_detail_page(detail_url, title, rank, hot)The function get_detail_page(detail_url, title, rank, hot) parses the detail page and extracts the top three comments, appending each record to all_df.
def get_detail_page(detail_url, title, rank, hot):
global all_df
try:
page_text = requests.get(url=detail_url, headers=headers).text
except:
return None
tree = etree.HTML(page_text)
result_df = pd.DataFrame(columns=np.array(all_df.columns))
for i in range(1, 4):
try:
comment_time = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[1]/a/text()')[0]
comment_time = re.sub('\s', '', comment_time)
user_name = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]/@nick-name')[0]
forward_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[1]/a/text()')[1].strip()
comment_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[2]/a/text()')[0].strip()
like_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[3]/a/button/span[2]/text()')[0]
comment = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]//text()')
comment = ' '.join(comment).strip()
result_df.loc[len(result_df), :] = [rank, hot, title, comment_time, user_name, forward_count, comment_count, like_count, comment]
except Exception as e:
print(e)
continue
all_df = all_df.append(result_df, ignore_index=True)Finally, run the scraper and save the results.
if __name__ == '__main__':
url = 'https://s.weibo.com/top/summary?cate=realtimehot'
get_hot_list(url)
all_df.to_excel('Weibo_Hot_Search.xlsx', index=False)Note: Exceptions are caught to keep the script running smoothly.
Scheduling the Script
To run the script automatically every hour, use Windows Task Scheduler . Export the notebook as a .py file, adjust the Cookie and output path, then create a new task, set a trigger for hourly execution, and specify the Python interpreter as the program to run.
After confirming, the task will run at the scheduled times, or you can start it manually.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
