Python Web Scraping and Data Visualization of Maoyan Movie Rankings
This tutorial demonstrates how to use Python on Windows to crawl Maoyan movie rankings, extract details such as title, rating, genre, region, and duration, store them in a CSV file, and then perform comprehensive data cleaning, analysis, and visualization with pandas, matplotlib, and WordCloud.
The article explains a step‑by‑step workflow for collecting and visualizing movie data from the Maoyan website using Python on a Windows environment.
Tools preparation: data source URL (https://maoyan.com/board/4?offset=1), development environment (Windows 10, Python 3.7), and IDEs (PyCharm, Chrome).
Project idea: scrape all movies listed on Maoyan’s Top 100 board, capturing fields such as movie name, rating, link, genre, release location, and duration.
The scraper first parses the list pages to obtain detail page URLs and integer/fractional rating parts, then visits each detail page to extract the required information and writes a line to 猫眼.csv for later analysis.
#!/usr/bin/env python # -*- coding: utf-8 -*- import requests from fake_useragent import UserAgent from lxml import etree import time ua = UserAgent() headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'Cookie': '__mta=244176442.1622872454168.1622876903037.1622877097390.7; uuid_n_v=v1; uuid=6FFF6D30C5C211EB8D61CF53B1EFE83FE91D3C40EE5240DCBA0A422050B1E8C0; _csrf=bff9b813020b795594ff3b2ea3c1be6295b7453d19ecd72f8beb9700c679dfb4; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1622872443; _lxsdk_cuid=1770e9ed136c8-048c356e76a22b-7d677965-1fa400-1770e9ed136c8; _lxsdk=6FFF6D30C5C211EB8D61CF53B1EFE83FE91D3C40EE5240DCBA0A422050B1E8C0; ci=59; recentCis=59; __mta=51142166.1622872443578.1622872443578.1622876719906.2; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1622877097; _lxsdk_s=179dafd56bf-06d-403-d81||12', 'User-Agent': str(ua.random) } def RequestsTools(url): '''爬虫请求工具函数 :param url: 请求地址 :return: HTML对象 用于xpath提取''' response = requests.get(url, headers=headers).content.decode('utf-8') html = etree.HTML(response) return html def Index(page): '''首页函数 :param page: 页数 :return:''' url = f'https://maoyan.com/board/4?offset={page}' html = RequestsTools(url) urls_text = html.xpath('//a[@class="image-link"]/@href') pingfen1 = html.xpath('//i[@class="integer"]/text()') pingfen2 = html.xpath('//i[@class="fraction"]/text()') for i, p1, p2 in zip(urls_text, pingfen1, pingfen2): pingfen = p1 + p2 urs = 'https://maoyan.com' + i time.sleep(2) Details(urs, pingfen) def Details(url, pingfen): html = RequestsTools(url) dianyan = html.xpath('//h1[@class="name"]/text()') leixing = html.xpath('//li[@class="ellipsis"]/a/text()') diqu = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[2]/text()') timedata = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[3]/text()') for d, l, b, t in zip(dianyan, leixing, diqu, timedata): countyr = b.replace('\n', '').split('/')[0] shichang = b.replace('\n', '').split('/')[1] with open('猫眼.csv', 'a') as f: f.write(f'{d}, {pingfen}, {url}, {l}, {countyr}, {shichang}, {t}\n') print(d, pingfen, url, l, countyr, shichang, t) for page in range(0, 11): page *= 10 Index(page)
After crawling, the data are saved into 猫眼.csv , which serves as the input for the visualization stage.
Data visualization tools import:
import pandas as pd import numpy as np import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt # %matplotlib inline
The analysis script loads the CSV, cleans missing or duplicate rows, and creates several plots: number of movies per year, average rating over time, genre distribution, duration‑vs‑rating scatter, and comparative charts between Chinese and worldwide movies.
# Example of loading and cleaning data path = './maoyan.csv' df = pd.read_csv(path, sep=',', encoding='utf-8', index_col=False) df.drop(df.columns[0], axis=1, inplace=True) df.dropna(inplace=True) df.drop_duplicates(inplace=True) # Plotting movies per year (before 2018) fig, ax = plt.subplots(figsize=(9, 6), dpi=70) df[df['上映时间'] < 2018]['上映时间'].value_counts().sort_index().plot(kind='line', ax=ax) ax.set_xlabel('时间(年)') ax.set_ylabel('上映数量') ax.set_title('上映时间&上映的电影数目') # Scatter of duration vs rating x = df[df['评分'] > 0].sort_values(by='时长(min)')['时长(min)'].values y = df[df['评分'] > 0].sort_values(by='时长(min)')['评分'].values fig, ax = plt.subplots(figsize=(9, 6), dpi=70) ax.scatter(x, y, alpha=0.6, marker='o') ax.set_xlabel('时长(min)') ax.set_ylabel('评分') ax.set_title('影片时长&评分分布图') # Word cloud of movie titles wl = ",".join(df['电影'][:15].values) wc = WordCloud(background_color='white', font_path='C:\\Windows\\Fonts\\simkai.ttf', max_font_size=60, random_state=30) myword = wc.generate(wl) wc.to_file('result.jpg') plt.imshow(myword) plt.axis('off') plt.show()
The resulting figures (included in the original article) illustrate trends such as the growth of movie releases over years, rating evolution, genre popularity, and differences between Chinese and global film markets.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.