Backend Development 7 min read

Scraping iQiyi Bullet Comments and Generating a Word Cloud with Python

This article demonstrates how to scrape bullet comments from iQiyi for the first episode of a popular mystery series, decode the binary files, extract the text, and use Python's jieba and wordcloud libraries to clean the data and generate a visual word cloud of audience sentiments.

Python Programming Learning Circle

Jun 28, 2020

Scraping iQiyi Bullet Comments and Generating a Word Cloud with Python

Recently a popular mystery drama "The Hidden Corner" (Douban rating 9.0) was selected for analysis; the author crawled the bullet comments of its first episode from iQiyi and created a word cloud to visualize audience feedback.

The article is divided into two parts: (1) crawling the bullet comments from iQiyi, and (2) processing the comments and generating a word cloud.

iQiyi bullet files are harder to crawl because the downloaded files appear as garbled binary data. The author explains how to open the browser’s Network panel, search for "bullet", locate the binary files, and note that each episode loads a new bullet file every 5 minutes.

The URL pattern for bullet files is:

https://cmts.iqiyi.com/bullet/{tvid_first_two}/{tvid_last_two}/{tvid}_300_{x}.z

where x is the ceiling of total duration divided by 300 seconds (5‑minute intervals). For the first episode (77 minutes) this results in 16 files.

Scraping code (Python):

import zlib
import requests
for x in range(16):
    x += 1
    url = 'https://cmts.iqiyi.com/bullet/92/00/9000000005439200_300_' + str(x) + '.z'
    bulletold = requests.get(url).content  # garbled binary
    bulletnew = bytearray(bulletold)        # re‑encode binary
    xml = zlib.decompress(bulletnew, 15+32).decode('utf-8')
    with open('./iqiyi' + str(x) + '.xml', 'a+', encoding='utf-8') as f:
        f.write(xml)
    f.close()

The resulting XML files contain content fields that hold the actual comments. To extract these, the following code is used:

from xml.dom.minidom import parse
import xml.dom.minidom
for x in range(16):
    x += 1
    DOMTree = xml.dom.minidom.parse(r"C:\Users\dmj\PycharmProjects\test\iqiyi" + str(x) + ".xml")
    collection = DOMTree.documentElement
    entrys = collection.getElementsByTagName("entry")
    for entry in entrys:
        content = entry.getElementsByTagName('content')[0]
        i = content.childNodes[0].data
        with open("dan_mu.txt", mode="a+", encoding="utf-8") as f:
            f.write(i)
            f.write("
")

The extracted dan_mu.txt file contains all bullet comments, which are then processed for word‑cloud generation.

Word‑cloud creation uses the wordcloud and jieba libraries. The code performs Chinese word segmentation, removes stop words, and generates the cloud:

from wordcloud import WordCloud
import jieba
import matplotlib.pyplot as plt
with open('./dan_mu.txt', encoding='utf-8', mode='r') as f:
    myText = f.read()
myText = " ".join(jieba.cut(myText))
words = myText.split(" ")
# remove unwanted tokens
for i in range(len(words)-1, -1, -1):
    if len(words[i]) == 1 or words[i] in ["这个", "不是", "这么", "怎么", "这是", "还是"]:
        words.pop(i)
myText = " ".join(words)
wordcloud = WordCloud(background_color="white", font_path="simsun.ttf", height=300, width=400).generate(myText)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
wordcloud.to_file("wordCloudMo.png")

The author notes that installing wordcloud may produce various errors; a linked CSDN article provides troubleshooting steps.

The final word cloud highlights frequent terms such as "真实" (real), "孩子" (child), "演技" (acting), indicating positive audience sentiment toward the drama.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data processing iQIYI text-mining Web Scraping word cloud

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.