Can Web‑Scraped Movie Reviews Predict Box Office? A Python Data‑Mining Case Study
Using Python to scrape over ten thousand Maoyan comments for the comedy film “The Billionaire” (西虹市首富), this article demonstrates data cleaning, geographic heat‑maps, city‑wise rating analysis, word‑cloud generation, and a simple box‑office forecast based on a comparable movie, illustrating practical web‑scraping and data‑mining techniques.
Preface
In recent years the Chinese comedy brand "Happy Mahua" has become a box‑office guarantee. From "Goodbye Mr. Loser" to "Never Say Die" and the latest "The Billionaire", the films have drawn huge audiences. This article analyzes whether "The Billionaire" is worth watching by examining more than ten thousand user comments scraped from Maoyan.
Data Crawling
We followed previously published Maoyan scraping methods, called its API, retrieved comment batches, removed duplicates, and finally obtained over ten thousand records. The essential crawling code is shown below.
tomato = pd.DataFrame(columns=['date','score','city','comment','nick'])
for i in range(0, 1000):
j = random.randint(1,1000)
print(str(i)+' '+str(j))
try:
time.sleep(2)
url = 'http://m.maoyan.com/mmdb/comments/movie/1212592.json?_v_=yes&offset=' + str(j)
html = requests.get(url=url).content
data = json.loads(html.decode('utf-8'))['cmts']
for item in data:
tomato = tomato.append({
'date': item['time'].split(' ')[0],
'city': item['cityName'],
'score': item['score'],
'comment': item['content'],
'nick': item['nick']
}, ignore_index=True)
tomato.to_csv('西虹市首富4.csv', index=False)
except:
continueData Analysis
We first visualized the geographic distribution of comments with a heat map.
The heat map shows that traditional strong markets such as the Jing‑Jin‑Ji region, the Yangtze River Delta and the Pearl River Delta dominate, while the Northeast and Sichuan‑Chongqing also exhibit high activity.
Next we examined comment volume and average scores for major cities.
Harbin (Shen Teng's hometown) achieved the highest average score of 4.77, while Hefei and Zhengzhou received the lowest scores.
Sorting cities by average score reveals that four of the top seven cities are in the Northeast, whereas lower‑scoring cities are mainly in central China.
We projected the scores onto a map (red = high, blue = low).
Word‑cloud visualizations of the most frequent terms further illustrate audience sentiment.
Partial Code Samples
Heat‑map generation (pyecharts):
tomato_com = pd.read_excel('西虹市首富.xlsx')
grouped = tomato_com.groupby(['city'])
grouped_pct = grouped['score']
city_com = grouped_pct.agg(['mean','count'])
city_com.reset_index(inplace=True)
city_com['mean'] = round(city_com['mean'],2)
data = [(city_com['city'][i], city_com['count'][i]) for i in range(0, city_com.shape[0])]
geo = Geo('《西虹市首富》全国热力图', title_color="#fff", title_pos='center', width=1200, height=600, background_color='#404a59')
attr, value = geo.cast(data)
geo.add('', attr, value, type='heatmap', visual_range=[0,200], visual_text_color="#fff", symbol_size=10, is_visualmap=True, is_roam=False)
geo.render('西虹市首富全国热力图.html')Combined line and bar chart for city comment count and average score:
city_main = city_com.sort_values('count', ascending=False)[0:20]
attr = city_main['city']
v1 = city_main['count']
v2 = city_main['mean']
line = Line('主要城市评分')
line.add('城市', attr, v2, is_stack=True, xaxis_rotate=30, yaxis_min=4.2, mark_point=['min','max'], line_color='lightblue', line_width=4)
bar = Bar('主要城市评论数')
bar.add('城市', attr, v1, is_stack=True, xaxis_rotate=30, yaxis_min=4.2)
overlap = Overlap()
overlap.add(bar)
overlap.add(line, yaxis_index=1, is_add_yaxis=True)
overlap.render('主要城市评论数_平均分.html')Word‑cloud creation (wordcloud, jieba, matplotlib):
tomato_str = ' '.join(tomato_com['comment'])
words_list = []
word_generator = jieba.cut_for_search(tomato_str)
for word in word_generator:
words_list.append(word)
words_list = [k for k in words_list if len(k)>1]
back_color = imread('西红柿.jpg')
wc = WordCloud(background_color='white', max_words=200, mask=back_color, max_font_size=300,
stopwords=STOPWORDS.add('苟利国'), font_path='C:/Windows/Fonts/STFANGSO.ttf', random_state=42)
tomato_count = Counter(words_list)
wc.generate_from_frequencies(tomato_count)
image_colors = ImageColorGenerator(back_color)
plt.figure()
plt.imshow(wc.recolor(color_func=image_colors))
plt.axis('off')Box Office Forecast
Using the comparable comedy "Never Say Die" as a benchmark, we estimated the final box office of "The Billionaire" to be roughly 30 billion yuan, after adjusting for differences in release timing.
Data set download URL: https://github.com/shujusenlin/tomato_film/blob/master/西虹市首富.xlsx
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
