Backend Development 8 min read

Build a Real‑Time Weibo Hot Search Scraper with Python

This tutorial shows how to use Python's requests, lxml, and pandas libraries to crawl Weibo's real‑time hot search list, extract ranking, popularity, titles, and detailed comments, and schedule the script to run automatically each hour.

Python Crawling & Data Mining

Feb 18, 2022

Build a Real‑Time Weibo Hot Search Scraper with Python

Page Analysis

Hot Search Page

The hot list homepage ( https://s.weibo.com/top/summary?cate=realtimehot) displays fifty items. For each item we need to capture rank, heat, title, and the detail‑page link. After logging in, open the developer tools (F12), refresh (Ctrl+R), and record the Cookie and User‑Agent values.

Use Google to obtain the XPath expressions for the required elements.

Detail Page

From the detail page we extract comment time, user name, forward count, comment count, like count, and comment content. The extraction method is similar to the hot‑search page.

Scraping Code

Import the required modules.

import requests
from time import sleep
import pandas as pd
import numpy as np
from lxml import etree
import re

Define global variables. headers: request headers all_df: DataFrame to store the collected data

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36',
    'Cookie': '''your Cookie here'''
}
all_df = pd.DataFrame(columns=['Rank','Heat','Title','Comment Time','User Name','Forward Count','Comment Count','Like Count','Comment'])

The function get_hot_list(url) sends a request to the hot‑search page, parses the table rows, extracts the detail‑page URL, title, rank, and heat, then calls get_detail_page for each item.

def get_hot_list(url):
    page_text = requests.get(url=url, headers=headers).text
    tree = etree.HTML(page_text)
    tr_list = tree.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr')
    for tr in tr_list:
        parse_url = tr.xpath('./td[2]/a/@href')[0]
        detail_url = 'https://s.weibo.com' + parse_url
        title = tr.xpath('./td[2]/a/text()')[0]
        try:
            rank = tr.xpath('./td[1]/text()')[0]
            hot = tr.xpath('./td[2]/span/text()')[0]
        except:
            rank = 'Pinned'
            hot = 'Pinned'
        get_detail_page(detail_url, title, rank, hot)

The function get_detail_page(detail_url, title, rank, hot) fetches the detail page, extracts up to three top comments, and appends the results to all_df.

def get_detail_page(detail_url, title, rank, hot):
    global all_df
    try:
        page_text = requests.get(url=detail_url, headers=headers).text
    except:
        return None
    tree = etree.HTML(page_text)
    result_df = pd.DataFrame(columns=np.array(all_df.columns))
    for i in range(1,4):
        try:
            comment_time = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[1]/a/text()')[0]
            comment_time = re.sub('\s','',comment_time)
            user_name = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]/@nick-name')[0]
            forward_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[1]/a/text()')[1].strip()
            comment_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[2]/a/text()')[0].strip()
            like_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[3]/a/button/span[2]/text()')[0]
            comment = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]//text()')
            comment = ' '.join(comment).strip()
            result_df.loc[len(result_df),:] = [rank, hot, title, comment_time, user_name, forward_count, comment_count, like_count, comment]
        except Exception as e:
            print(e)
            continue
    all_df = all_df.append(result_df, ignore_index=True)

Finally, schedule the script to run hourly using the Windows Task Scheduler. Adjust the Cookie and output file path, then export the notebook as a .py file and create a scheduled task that executes the script.

if __name__ == '__main__':
    url = 'https://s.weibo.com/top/summary?cate=realtimehot'
    get_hot_list(url)
    all_df.to_excel('WeiboHotSearch.xlsx', index=False)

The script includes exception handling to ignore occasional errors, ensuring stable execution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Web Scraping Weibo lxml

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.