How to Build a Python Weibo Hot Search Scraper and Automate Data Collection

This tutorial walks you through creating a Python web scraper that logs into Weibo, extracts the real‑time hot search list and top comments using requests, lxml and pandas, saves the data to Excel, and schedules the script to run hourly with Windows Task Scheduler.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Build a Python Weibo Hot Search Scraper and Automate Data Collection

Page Analysis

Hot Search Page

The hot search list is available at https://s.weibo.com/top/summary?cate=realtimehot. After logging in, open the developer tools (F12), refresh (Ctrl+R), and capture the request Cookie and User-Agent. Use a tool such as Google to obtain the XPath expressions for the required elements.

Hot search page layout
Hot search page layout

For the detail page, similar steps are used to capture comment time, user name, forward count, comment count, like count, and comment content.

Detail page data
Detail page data

Scraping Code

First import the required modules.

import requests
from time import sleep
import pandas as pd
import numpy as np
from lxml import etree
import re

Define global variables. headers: request headers all_df: DataFrame to store the collected data

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36',
    'Cookie': '''your Cookie here'''
}
all_df = pd.DataFrame(columns=['Rank', 'Hot', 'Title', 'Comment Time', 'User Name', 'Forward Count', 'Comment Count', 'Like Count', 'Comment Content'])

The function get_hot_list(url) fetches the hot‑search page, extracts each entry’s rank, hotness, title, and detail URL, then calls get_detail_page for further processing.

def get_hot_list(url):
    page_text = requests.get(url=url, headers=headers).text
    tree = etree.HTML(page_text)
    tr_list = tree.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr')
    for tr in tr_list:
        parse_url = tr.xpath('./td[2]/a/@href')[0]
        detail_url = 'https://s.weibo.com' + parse_url
        title = tr.xpath('./td[2]/a/text()')[0]
        try:
            rank = tr.xpath('./td[1]/text()')[0]
            hot = tr.xpath('./td[2]/span/text()')[0]
        except:
            rank = 'Pinned'
            hot = 'Pinned'
        get_detail_page(detail_url, title, rank, hot)

The function get_detail_page(detail_url, title, rank, hot) parses the detail page and extracts the top three comments, appending each record to all_df.

def get_detail_page(detail_url, title, rank, hot):
    global all_df
    try:
        page_text = requests.get(url=detail_url, headers=headers).text
    except:
        return None
    tree = etree.HTML(page_text)
    result_df = pd.DataFrame(columns=np.array(all_df.columns))
    for i in range(1, 4):
        try:
            comment_time = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[1]/a/text()')[0]
            comment_time = re.sub('\s', '', comment_time)
            user_name = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]/@nick-name')[0]
            forward_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[1]/a/text()')[1].strip()
            comment_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[2]/a/text()')[0].strip()
            like_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[3]/a/button/span[2]/text()')[0]
            comment = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]//text()')
            comment = ' '.join(comment).strip()
            result_df.loc[len(result_df), :] = [rank, hot, title, comment_time, user_name, forward_count, comment_count, like_count, comment]
        except Exception as e:
            print(e)
            continue
    all_df = all_df.append(result_df, ignore_index=True)

Finally, run the scraper and save the results.

if __name__ == '__main__':
    url = 'https://s.weibo.com/top/summary?cate=realtimehot'
    get_hot_list(url)
    all_df.to_excel('Weibo_Hot_Search.xlsx', index=False)

Note: Exceptions are caught to keep the script running smoothly.

Excel output example
Excel output example

Scheduling the Script

To run the script automatically every hour, use Windows Task Scheduler . Export the notebook as a .py file, adjust the Cookie and output path, then create a new task, set a trigger for hourly execution, and specify the Python interpreter as the program to run.

Task Scheduler – Create Task
Task Scheduler – Create Task
Task Scheduler – Set Trigger
Task Scheduler – Set Trigger
Task Scheduler – Set Action
Task Scheduler – Set Action

After confirming, the task will run at the scheduled times, or you can start it manually.

Running result screenshot
Running result screenshot
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data collectionautomationrequestsWeibolxml
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.