Artificial Intelligence 16 min read

Scrape, Clean, and Visualize Tencent Video Comments with Python – A Full Guide

This article walks through using Python to crawl Tencent Video's "Offer" season 2 comments, merge and clean the CSV data, perform exploratory analysis, generate visualizations and word clouds, and apply Baidu's open‑source NLP model for sentiment scoring, providing complete code snippets for each step.

Python Crawling & Data Mining

Dec 2, 2020

Scrape, Clean, and Visualize Tencent Video Comments with Python – A Full Guide

Preface

Hello, I am J‑bro. This tutorial demonstrates how to collect over 130,000 Danmu (bullet‑screen comments) from the second season of the variety show "Offer", clean the data, visualize it, and conduct sentiment analysis.

Data Acquisition

The second season of "Offer" is streamed exclusively on Tencent Video. Four episodes (including the interview episode) were crawled separately. The spider code is provided below.

#-*- coding = uft-8 -*-
#@Time : 2020/11/30 21:35 
#@Author : 公众号 菜J学Python
#@File : tengxun_danmu.py

import requests
import json
import time
import pandas as pd

target_id = "6130942571%26"  # interview episode target_id
vid = "%3Dt0034o74jpr"      # interview episode vid
df = pd.DataFrame()
for page in range(15, 3214, 30):  # video length 3214 seconds
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
    url = 'https://mfm.video.qq.com/danmu?otype=json×tamp={0}&target_id={1}vid{2}&count=80'.format(page, target_id, vid)
    print("正在提取第" + str(page) + "页")
    html = requests.get(url, headers=headers)
    bs = json.loads(html.text, strict=False)  # strict=False solves some JSON parsing errors
    time.sleep(1)
    for i in bs['comments']:
        content = i['content']      # Danmu text
        upcount = i['upcount']      # likes
        user_degree = i['uservip_degree']  # VIP level
        timepoint = i['timepoint']  # timestamp
        comment_id = i['commentid'] # Danmu ID
        cache = pd.DataFrame({'弹幕':[content],'会员等级':[user_degree],'发布时间':[timepoint],'弹幕点赞':[upcount],'弹幕id':[comment_id]})
        df = pd.concat([df, cache])

df.to_csv('面试篇.csv', encoding='utf-8')

After crawling each episode, place the four CSV files into a single folder.

Data Cleaning

Merge Data

Combine the four CSV files using pandas.concat.

import pandas as pd
import numpy as np
df1 = pd.read_csv('/菜J学Python/弹幕/腾讯/令人心动的offer/面试篇.csv')
df1["期数"] = "面试篇"

df2 = pd.read_csv('/菜J学Python/弹幕/腾讯/令人心动的offer/第1期.csv')
df2["期数"] = "第1期"

df3 = pd.read_csv('/菜J学Python/弹幕/腾讯/令人心动的offer/第2期.csv')
df3["期数"] = "第2期"

df4 = pd.read_csv('/菜J学Python/弹幕/腾讯/令人心动的offer/第3期.csv')
df4["期数"] = "第3期"

df = pd.concat([df1, df2, df3, df4])

Rename Fields

df = df.rename(columns={'用户名':'用户昵称','内容':'弹幕内容','评论时间点':'发送时间','评论点赞':'弹幕点赞','期数':'所属期数'})

Filter Fields

# Select columns needed for analysis
df = df[["用户昵称","弹幕内容","会员等级","发送时间","弹幕点赞","所属期数"]]

Missing Value Handling

df["用户昵称"] = df["用户昵称"].fillna("无名氏")

Time Conversion

def time_change(seconds):
    m, s = divmod(seconds, 60)
    h, m = divmod(m, 60)
    ss_time = "%d:%02d:%02d" % (h, m, s)
    print(ss_time)
    return ss_time

time_change(seconds=8888)

df["发送时间"] = df["发送时间"].apply(time_change)

df['发送时间'] = pd.to_datetime(df['发送时间'])
df['发送时间'] = df['发送时间'].apply(lambda x: x.strftime('%H:%M:%S'))

Content Processing

# Convert object type to string
df["弹幕内容"] = df["弹幕内容"].astype("str")

# Mechanical compression to remove repeated substrings
def yasuo(st):
    for i in range(1, int(len(st)/2)+1):
        for j in range(len(st)):
            if st[j:j+i] == st[j+i:j+2*i]:
                k = j + i
                while st[k:k+i] == st[k+i:k+2*i] and k < len(st):
                    k = k + i
                st = st[:j] + st[k:]
    return st

yasuo(st="菜J学Python真的真的真的很菜很菜")

# Apply compression
df["弹幕内容"] = df["弹幕内容"].apply(yasuo)

# Extract Chinese characters and drop pure emojis
df['弹幕内容'] = df['弹幕内容'].str.extract(r"([\u4e00-\u9fa5]+)")
df = df.dropna()

Data Analysis

Comment Count per Episode

The second season has four episodes. Episode 1 received the most Danmu (42,422), while the interview episode received the fewest (17,332).

Top Commenters

User "想太多de猫" posted 227 Danmu across episodes, far ahead of others.

Membership Level Distribution

74.31% of viewers are non‑members, 5.6% are level 3 members, and 5.39% are level 1 members.

Word Cloud of Comments

High‑frequency words include "丁辉", "律师", "喜欢", "加油", "徐律", "干饭", "撒老师".

Intern Mentions

Intern "丁辉" was mentioned 9,298 times, far more than the others; "詹秋怡" 2,455 times; "刘煜成" only 526 times.

Sentiment Analysis

Using Baidu's open‑source NLP model (senta_bilstm), the overall sentiment score of the second season exceeds 0.5, indicating a generally positive audience attitude. Higher‑level members tend to watch longer, and sentiment peaks at the start and end of each episode.

import paddlehub as hub
# Use Baidu's pretrained sentiment model
senta = hub.Module(name="senta_bilstm")
texts = df['弹幕内容'].tolist()
input_data = {'text': texts}
res = senta.sentiment_classify(data=input_data)
df['情感分值'] = [x['positive_probs'] for x in res]
# Resample to 15‑minute intervals
df.index = df['发送时间']
data = df.resample('15min').mean().reset_index()

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Sentiment Analysis data cleaning Web Scraping

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.