Python Web Scraping and Data Visualization of Maoyan Movie Rankings
This tutorial demonstrates how to use Python on Windows to crawl Maoyan movie rankings, extract details such as title, rating, genre, region, and duration, store them in a CSV file, and then perform comprehensive data cleaning, analysis, and visualization with pandas, matplotlib, and WordCloud.
The article explains a step‑by‑step workflow for collecting and visualizing movie data from the Maoyan website using Python on a Windows environment.
Tools preparation: data source URL (https://maoyan.com/board/4?offset=1), development environment (Windows 10, Python 3.7), and IDEs (PyCharm, Chrome).
Project idea: scrape all movies listed on Maoyan’s Top 100 board, capturing fields such as movie name, rating, link, genre, release location, and duration.
The scraper first parses the list pages to obtain detail page URLs and integer/fractional rating parts, then visits each detail page to extract the required information and writes a line to 猫眼.csv for later analysis.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from fake_useragent import UserAgent
from lxml import etree
import time
ua = UserAgent()
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Cookie': '__mta=244176442.1622872454168.1622876903037.1622877097390.7; uuid_n_v=v1; uuid=6FFF6D30C5C211EB8D61CF53B1EFE83FE91D3C40EE5240DCBA0A422050B1E8C0; _csrf=bff9b813020b795594ff3b2ea3c1be6295b7453d19ecd72f8beb9700c679dfb4; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1622872443; _lxsdk_cuid=1770e9ed136c8-048c356e76a22b-7d677965-1fa400-1770e9ed136c8; _lxsdk=6FFF6D30C5C211EB8D61CF53B1EFE83FE91D3C40EE5240DCBA0A422050B1E8C0; ci=59; recentCis=59; __mta=51142166.1622872443578.1622872443578.1622876719906.2; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1622877097; _lxsdk_s=179dafd56bf-06d-403-d81||12',
'User-Agent': str(ua.random)
}
def RequestsTools(url):
'''爬虫请求工具函数
:param url: 请求地址
:return: HTML对象 用于xpath提取'''
response = requests.get(url, headers=headers).content.decode('utf-8')
html = etree.HTML(response)
return html
def Index(page):
'''首页函数
:param page: 页数
:return:'''
url = f'https://maoyan.com/board/4?offset={page}'
html = RequestsTools(url)
urls_text = html.xpath('//a[@class="image-link"]/@href')
pingfen1 = html.xpath('//i[@class="integer"]/text()')
pingfen2 = html.xpath('//i[@class="fraction"]/text()')
for i, p1, p2 in zip(urls_text, pingfen1, pingfen2):
pingfen = p1 + p2
urs = 'https://maoyan.com' + i
time.sleep(2)
Details(urs, pingfen)
def Details(url, pingfen):
html = RequestsTools(url)
dianyan = html.xpath('//h1[@class="name"]/text()')
leixing = html.xpath('//li[@class="ellipsis"]/a/text()')
diqu = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[2]/text()')
timedata = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[3]/text()')
for d, l, b, t in zip(dianyan, leixing, diqu, timedata):
countyr = b.replace('
', '').split('/')[0]
shichang = b.replace('
', '').split('/')[1]
with open('猫眼.csv', 'a') as f:
f.write(f'{d}, {pingfen}, {url}, {l}, {countyr}, {shichang}, {t}
')
print(d, pingfen, url, l, countyr, shichang, t)
for page in range(0, 11):
page *= 10
Index(page)After crawling, the data are saved into 猫眼.csv, which serves as the input for the visualization stage.
Data visualization tools import:
import pandas as pd
import numpy as np
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# %matplotlib inlineThe analysis script loads the CSV, cleans missing or duplicate rows, and creates several plots: number of movies per year, average rating over time, genre distribution, duration‑vs‑rating scatter, and comparative charts between Chinese and worldwide movies.
# Example of loading and cleaning data
path = './maoyan.csv'
df = pd.read_csv(path, sep=',', encoding='utf-8', index_col=False)
df.drop(df.columns[0], axis=1, inplace=True)
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
# Plotting movies per year (before 2018)
fig, ax = plt.subplots(figsize=(9, 6), dpi=70)
df[df['上映时间'] < 2018]['上映时间'].value_counts().sort_index().plot(kind='line', ax=ax)
ax.set_xlabel('时间(年)')
ax.set_ylabel('上映数量')
ax.set_title('上映时间&上映的电影数目')
# Scatter of duration vs rating
x = df[df['评分'] > 0].sort_values(by='时长(min)')['时长(min)'].values
y = df[df['评分'] > 0].sort_values(by='时长(min)')['评分'].values
fig, ax = plt.subplots(figsize=(9, 6), dpi=70)
ax.scatter(x, y, alpha=0.6, marker='o')
ax.set_xlabel('时长(min)')
ax.set_ylabel('评分')
ax.set_title('影片时长&评分分布图')
# Word cloud of movie titles
wl = ",".join(df['电影'][:15].values)
wc = WordCloud(background_color='white', font_path='C:\\Windows\\Fonts\\simkai.ttf', max_font_size=60, random_state=30)
myword = wc.generate(wl)
wc.to_file('result.jpg')
plt.imshow(myword)
plt.axis('off')
plt.show()The resulting figures (included in the original article) illustrate trends such as the growth of movie releases over years, rating evolution, genre popularity, and differences between Chinese and global film markets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
