Backend Development 9 min read

How to Build a Python Weibo Hot Search Scraper and Automate Data Collection

This tutorial walks you through creating a Python web scraper that logs into Weibo, extracts the real‑time hot search list and top comments using requests, lxml and pandas, saves the data to Excel, and schedules the script to run hourly with Windows Task Scheduler.

Python Crawling & Data Mining

Mar 15, 2022

How to Build a Python Weibo Hot Search Scraper and Automate Data Collection

Page Analysis

Hot Search Page

The hot search list is available at https://s.weibo.com/top/summary?cate=realtimehot. After logging in, open the developer tools (F12), refresh (Ctrl+R), and capture the request Cookie and User-Agent. Use a tool such as Google to obtain the XPath expressions for the required elements.

For the detail page, similar steps are used to capture comment time, user name, forward count, comment count, like count, and comment content.

Scraping Code

First import the required modules.

import requests
from time import sleep
import pandas as pd
import numpy as np
from lxml import etree
import re

Define global variables. headers: request headers all_df: DataFrame to store the collected data

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36',
    'Cookie': '''your Cookie here'''
}
all_df = pd.DataFrame(columns=['Rank', 'Hot', 'Title', 'Comment Time', 'User Name', 'Forward Count', 'Comment Count', 'Like Count', 'Comment Content'])

The function get_hot_list(url) fetches the hot‑search page, extracts each entry’s rank, hotness, title, and detail URL, then calls get_detail_page for further processing.

def get_hot_list(url):
    page_text = requests.get(url=url, headers=headers).text
    tree = etree.HTML(page_text)
    tr_list = tree.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr')
    for tr in tr_list:
        parse_url = tr.xpath('./td[2]/a/@href')[0]
        detail_url = 'https://s.weibo.com' + parse_url
        title = tr.xpath('./td[2]/a/text()')[0]
        try:
            rank = tr.xpath('./td[1]/text()')[0]
            hot = tr.xpath('./td[2]/span/text()')[0]
        except:
            rank = 'Pinned'
            hot = 'Pinned'
        get_detail_page(detail_url, title, rank, hot)

The function get_detail_page(detail_url, title, rank, hot) parses the detail page and extracts the top three comments, appending each record to all_df.

def get_detail_page(detail_url, title, rank, hot):
    global all_df
    try:
        page_text = requests.get(url=detail_url, headers=headers).text
    except:
        return None
    tree = etree.HTML(page_text)
    result_df = pd.DataFrame(columns=np.array(all_df.columns))
    for i in range(1, 4):
        try:
            comment_time = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[1]/a/text()')[0]
            comment_time = re.sub('\s', '', comment_time)
            user_name = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]/@nick-name')[0]
            forward_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[1]/a/text()')[1].strip()
            comment_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[2]/a/text()')[0].strip()
            like_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[3]/a/button/span[2]/text()')[0]
            comment = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]//text()')
            comment = ' '.join(comment).strip()
            result_df.loc[len(result_df), :] = [rank, hot, title, comment_time, user_name, forward_count, comment_count, like_count, comment]
        except Exception as e:
            print(e)
            continue
    all_df = all_df.append(result_df, ignore_index=True)

Finally, run the scraper and save the results.

if __name__ == '__main__':
    url = 'https://s.weibo.com/top/summary?cate=realtimehot'
    get_hot_list(url)
    all_df.to_excel('Weibo_Hot_Search.xlsx', index=False)

Note: Exceptions are caught to keep the script running smoothly.

Scheduling the Script

To run the script automatically every hour, use Windows Task Scheduler . Export the notebook as a .py file, adjust the Cookie and output path, then create a new task, set a trigger for hourly execution, and specify the Python interpreter as the program to run.

After confirming, the task will run at the scheduled times, or you can start it manually.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection Automation requests Weibo lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.