Build a Real‑Time Weibo Hot Search Scraper with Python
This tutorial shows how to use Python's requests, lxml, and pandas libraries to crawl Weibo's real‑time hot search list, extract ranking, popularity, titles, and detailed comments, and schedule the script to run automatically each hour.
Page Analysis
Hot Search Page
The hot list homepage ( https://s.weibo.com/top/summary?cate=realtimehot) displays fifty items. For each item we need to capture rank, heat, title, and the detail‑page link. After logging in, open the developer tools (F12), refresh (Ctrl+R), and record the Cookie and User‑Agent values.
Use Google to obtain the XPath expressions for the required elements.
Detail Page
From the detail page we extract comment time, user name, forward count, comment count, like count, and comment content. The extraction method is similar to the hot‑search page.
Scraping Code
Import the required modules.
import requests
from time import sleep
import pandas as pd
import numpy as np
from lxml import etree
import reDefine global variables. headers: request headers all_df: DataFrame to store the collected data
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36',
'Cookie': '''your Cookie here'''
}
all_df = pd.DataFrame(columns=['Rank','Heat','Title','Comment Time','User Name','Forward Count','Comment Count','Like Count','Comment'])The function get_hot_list(url) sends a request to the hot‑search page, parses the table rows, extracts the detail‑page URL, title, rank, and heat, then calls get_detail_page for each item.
def get_hot_list(url):
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
tr_list = tree.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr')
for tr in tr_list:
parse_url = tr.xpath('./td[2]/a/@href')[0]
detail_url = 'https://s.weibo.com' + parse_url
title = tr.xpath('./td[2]/a/text()')[0]
try:
rank = tr.xpath('./td[1]/text()')[0]
hot = tr.xpath('./td[2]/span/text()')[0]
except:
rank = 'Pinned'
hot = 'Pinned'
get_detail_page(detail_url, title, rank, hot)The function get_detail_page(detail_url, title, rank, hot) fetches the detail page, extracts up to three top comments, and appends the results to all_df.
def get_detail_page(detail_url, title, rank, hot):
global all_df
try:
page_text = requests.get(url=detail_url, headers=headers).text
except:
return None
tree = etree.HTML(page_text)
result_df = pd.DataFrame(columns=np.array(all_df.columns))
for i in range(1,4):
try:
comment_time = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[1]/a/text()')[0]
comment_time = re.sub('\s','',comment_time)
user_name = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]/@nick-name')[0]
forward_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[1]/a/text()')[1].strip()
comment_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[2]/a/text()')[0].strip()
like_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[3]/a/button/span[2]/text()')[0]
comment = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]//text()')
comment = ' '.join(comment).strip()
result_df.loc[len(result_df),:] = [rank, hot, title, comment_time, user_name, forward_count, comment_count, like_count, comment]
except Exception as e:
print(e)
continue
all_df = all_df.append(result_df, ignore_index=True)Finally, schedule the script to run hourly using the Windows Task Scheduler. Adjust the Cookie and output file path, then export the notebook as a .py file and create a scheduled task that executes the script.
if __name__ == '__main__':
url = 'https://s.weibo.com/top/summary?cate=realtimehot'
get_hot_list(url)
all_df.to_excel('WeiboHotSearch.xlsx', index=False)The script includes exception handling to ignore occasional errors, ensuring stable execution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
