Backend Development 12 min read

Automate Weibo Hot Search Scraping and Daily Email Reports with Python

This tutorial shows how to use Python to fetch the Weibo hot search list, clean and organize the data with pandas, send the results via QQ email, and schedule the whole process to run automatically each evening.

Python Crawling & Data Mining

Sep 6, 2021

Automate Weibo Hot Search Scraping and Daily Email Reports with Python

One, Grab Hot Search Data

Open the Weibo hot search page https://s.weibo.com/top/summary. The page contains 50 hot topics, each with a title, search volume, index, and link. Use the requests library to download the HTML and the re module with patterns such as <a href="(.*?)" target="_blank">.*?</a>, <span>(.*?)</span>, and <td class="td-03">(.*?)</td> to extract the four fields.

Inspect the page source (F12 → Elements) to confirm the HTML structure and locate the patterns.

import requests
import re
url = 'https://s.weibo.com/top/summary?cate=realtimehot'  # 微博网址
ret = requests.get(url)
test = ret.text
u_href = '<a href="(.*?)" target="_blank">.*?</a>'
u_title = '<a href=".*?" target="_blank">(.*?)</a>'
u_amount = '<span>(.*?)</span>'
u_category = '<td class="td-03">(.*?)</td>'
title = re.findall(u_title, test)
amount = re.findall(u_amount, test)
category = re.findall(u_category, test)
href = re.findall(u_href, test)

Two, Data Cleaning

The raw lists contain extra elements (e.g., the first title without a rank, empty strings, and recommendation entries). Clean them by slicing, removing blanks, and constructing full URLs. Use pandas to build a DataFrame, filter categories ("爆", "沸", "热", "新", "空"), and save the result as a CSV file.

import pandas as pd
title = title[:-2]
title = title[1:]
href = href[:-2]
href = href[1:]
for j in range(len(href)):
    href[j] = 'https://s.weibo.com/' + href[j]
while '' in amount:
    amount.remove('')
for i in range(len(category)):
    if category[i] != '':
        category[i] = re.findall('<i class=".*?">(.*?)</i>', category[i])[0]
    if category[i] == '':
        category[i] = '空'
category = category[1:]
while '荐' in category:
    category.remove('荐')
df = pd.DataFrame()
df['关键词'] = title
df['amount'] = amount
df['category'] = category
df['href'] = href
df = df.sort_values('amount')
df2 = df[df['category']=='爆']
df3 = df[df['category']=='沸']
df4 = df[df['category']=='热']
df5 = df[df['category']=='新']
df6 = df[df['category']=='空']
df = pd.concat([df2,df3,df4,df5,df6], ignore_index=True)
df.to_csv('微博热搜.csv', encoding='gbk')  # 输出为 csv 文本格式

Three, Send Email and Schedule

Obtain a QQ SMTP authorization code (Settings → POP3/SMTP → Enable). Use smtplib and email.mime to compose an HTML email, attach the CSV file, and send it.

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
number = '你的QQ邮箱号码'
smtp = '邮箱对应的STMP授权码'
to = '接收方QQ邮箱号码'  # 可以是非 QQ 邮箱

mer = MIMEMultipart()
head = '''
<p>微博热搜榜信息</p>
<p>最热门词条为</p>
<p><a href="{}">{}</a></p>
<p>排名前五的热搜</p>
<p><a href="{}">{}</a></p>
<p><a href="{}">{}</a></p>
<p><a href="{}">{}</a></p>
<p><a href="{}">{}</a></p>
<p><a href="{}">{}</a></p>
''' .format(df.iloc[0,:]['href'], df.iloc[0,:]['关键词'],
           df.iloc[1,:]['href'], df.iloc[1,:]['关键词'],
           df.iloc[2,:]['href'], df.iloc[2,:]['关键词'],
           df.iloc[3,:]['href'], df.iloc[3,:]['关键词'],
           df.iloc[4,:]['href'], df.iloc[4,:]['关键词'],
           df.iloc[5,:]['href'], df.iloc[5,:]['关键词'])
mer.attach(MIMEText(head, 'html', 'utf-8'))
fujian = MIMEText(open('微博热搜.csv', 'rb').read(), 'base64', 'utf-8')
fujian['Content-Type'] = 'application/octet-stream'
fujian.add_header('Content-Disposition', 'file', filename=('utf-8', '', '微博热搜.csv'))
mer.attach(fujian)
mer['Subject'] = '每日微博热搜榜单'
mer['From'] = number
mer['To'] = to
s = smtplib.SMTP_SSL('smtp.qq.com', 465)
s.login(number, smtp)
s.send_message(mer)
s.quit()
print('成功发送')

Schedule the email to run every evening at 18:00 using the schedule library.

import schedule
import time

def email():
    # (the email‑sending code above)
    pass

schedule.every().day.at("18:00").do(email)
while True:
    schedule.run_pending()
    time.sleep(5)

With these steps, the script automatically crawls Weibo hot topics, cleans the data, emails a formatted report, and repeats daily.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Web Scraping Pandas Weibo email automation

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.