Big Data 10 min read

Can Web‑Scraped Movie Reviews Predict Box Office? A Python Data‑Mining Case Study

Using Python to scrape over ten thousand Maoyan comments for the comedy film “The Billionaire” (西虹市首富), this article demonstrates data cleaning, geographic heat‑maps, city‑wise rating analysis, word‑cloud generation, and a simple box‑office forecast based on a comparable movie, illustrating practical web‑scraping and data‑mining techniques.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Can Web‑Scraped Movie Reviews Predict Box Office? A Python Data‑Mining Case Study

Preface

In recent years the Chinese comedy brand "Happy Mahua" has become a box‑office guarantee. From "Goodbye Mr. Loser" to "Never Say Die" and the latest "The Billionaire", the films have drawn huge audiences. This article analyzes whether "The Billionaire" is worth watching by examining more than ten thousand user comments scraped from Maoyan.

Data Crawling

We followed previously published Maoyan scraping methods, called its API, retrieved comment batches, removed duplicates, and finally obtained over ten thousand records. The essential crawling code is shown below.

tomato = pd.DataFrame(columns=['date','score','city','comment','nick'])
for i in range(0, 1000):
    j = random.randint(1,1000)
    print(str(i)+' '+str(j))
    try:
        time.sleep(2)
        url = 'http://m.maoyan.com/mmdb/comments/movie/1212592.json?_v_=yes&offset=' + str(j)
        html = requests.get(url=url).content
        data = json.loads(html.decode('utf-8'))['cmts']
        for item in data:
            tomato = tomato.append({
                'date': item['time'].split(' ')[0],
                'city': item['cityName'],
                'score': item['score'],
                'comment': item['content'],
                'nick': item['nick']
            }, ignore_index=True)
        tomato.to_csv('西虹市首富4.csv', index=False)
    except:
        continue

Data Analysis

We first visualized the geographic distribution of comments with a heat map.

The heat map shows that traditional strong markets such as the Jing‑Jin‑Ji region, the Yangtze River Delta and the Pearl River Delta dominate, while the Northeast and Sichuan‑Chongqing also exhibit high activity.

Next we examined comment volume and average scores for major cities.

Harbin (Shen Teng's hometown) achieved the highest average score of 4.77, while Hefei and Zhengzhou received the lowest scores.

Sorting cities by average score reveals that four of the top seven cities are in the Northeast, whereas lower‑scoring cities are mainly in central China.

We projected the scores onto a map (red = high, blue = low).

Word‑cloud visualizations of the most frequent terms further illustrate audience sentiment.

Partial Code Samples

Heat‑map generation (pyecharts):

tomato_com = pd.read_excel('西虹市首富.xlsx')
grouped = tomato_com.groupby(['city'])
grouped_pct = grouped['score']
city_com = grouped_pct.agg(['mean','count'])
city_com.reset_index(inplace=True)
city_com['mean'] = round(city_com['mean'],2)

data = [(city_com['city'][i], city_com['count'][i]) for i in range(0, city_com.shape[0])]
geo = Geo('《西虹市首富》全国热力图', title_color="#fff", title_pos='center', width=1200, height=600, background_color='#404a59')
attr, value = geo.cast(data)
geo.add('', attr, value, type='heatmap', visual_range=[0,200], visual_text_color="#fff", symbol_size=10, is_visualmap=True, is_roam=False)
geo.render('西虹市首富全国热力图.html')

Combined line and bar chart for city comment count and average score:

city_main = city_com.sort_values('count', ascending=False)[0:20]
attr = city_main['city']
v1 = city_main['count']
v2 = city_main['mean']
line = Line('主要城市评分')
line.add('城市', attr, v2, is_stack=True, xaxis_rotate=30, yaxis_min=4.2, mark_point=['min','max'], line_color='lightblue', line_width=4)
bar = Bar('主要城市评论数')
bar.add('城市', attr, v1, is_stack=True, xaxis_rotate=30, yaxis_min=4.2)
overlap = Overlap()
overlap.add(bar)
overlap.add(line, yaxis_index=1, is_add_yaxis=True)
overlap.render('主要城市评论数_平均分.html')

Word‑cloud creation (wordcloud, jieba, matplotlib):

tomato_str = ' '.join(tomato_com['comment'])
words_list = []
word_generator = jieba.cut_for_search(tomato_str)
for word in word_generator:
    words_list.append(word)
words_list = [k for k in words_list if len(k)>1]
back_color = imread('西红柿.jpg')
wc = WordCloud(background_color='white', max_words=200, mask=back_color, max_font_size=300,
               stopwords=STOPWORDS.add('苟利国'), font_path='C:/Windows/Fonts/STFANGSO.ttf', random_state=42)
tomato_count = Counter(words_list)
wc.generate_from_frequencies(tomato_count)
image_colors = ImageColorGenerator(back_color)
plt.figure()
plt.imshow(wc.recolor(color_func=image_colors))
plt.axis('off')

Box Office Forecast

Using the comparable comedy "Never Say Die" as a benchmark, we estimated the final box office of "The Billionaire" to be roughly 30 billion yuan, after adjusting for differences in release timing.

Data set download URL: https://github.com/shujusenlin/tomato_film/blob/master/西虹市首富.xlsx

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonData Analysisvisualizationweb-scrapingBox Office PredictionMovie Reviews
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.