Scraping and Analyzing Douban Top250 Movies with Python
This tutorial shows how to use Python to crawl Douban's Top250 movie list, handle anti‑scraping measures, extract detailed fields, store the data in Excel, and perform data cleaning, statistical analysis, and visualizations such as year distribution, rating trends, and genre word clouds.
This article demonstrates how to use Python to crawl Douban's Top250 movie list, extract fields such as rank, title, director, year, country, genre, rating, number of reviews, and short comments, and then analyze and visualize the data.
It first explains pagination by modifying the start parameter in the URL and discusses anti‑scraping techniques, including setting appropriate User-Agent , Referer , and handling cookies or JavaScript‑generated parameters.
Basic extraction code using requests and lxml.etree is provided:
# -*- coding: utf-8 -*-
# @Author: Kun
import requests
from lxml import etree
import pandas as pd
df = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4343.0 Safari/537.36',
'Referer': 'https://movie.douban.com/top250'}
columns = ['排名','电影名称','导演','上映年份','制作国家','类型','评分','评价分数','短评']
def get_data(html):
xp = etree.HTML(html)
lis = xp.xpath('//*[@id="content"]/div/div[1]/ol/li')
for li in lis:
ranks = li.xpath('div/div[1]/em/text()')
titles = li.xpath('div/div[2]/div[1]/a/span[1]/text()')
directors = li.xpath('div/div[2]/div[2]/p[1]/text()')[0].strip().replace("\xa0\xa0\xa0","\t").split("\t")
infos = li.xpath('div/div[2]/div[2]/p[1]/text()')[1].strip().replace('\xa0','').split('/')
dates, areas, genres = infos[0], infos[1], infos[2]
ratings = li.xpath('.//div[@class="star"]/span[2]/text()')[0]
scores = li.xpath('.//div[@class="star"]/span[4]/text()')[0][:-3]
quotes = li.xpath('.//p[@class="quote"]/span/text()')
for rank, title, director in zip(ranks, titles, directors):
if len(quotes) == 0:
quotes = None
else:
quotes = quotes[0]
df.append([rank, title, director, dates, areas, genres, ratings, scores, quotes])
d = pd.DataFrame(df, columns=columns)
d.to_excel('Top250.xlsx', index=False)
for i in range(0, 251, 25):
url = "https://movie.douban.com/top250?start={}&filter=".format(str(i))
res = requests.get(url, headers=headers)
html = res.text
get_data(html)A multithreaded version using threading and a queue is also shown to speed up crawling.
# -*- coding: utf-8 -*-
import pandas as pd
import time
import requests
from lxml import etree
from queue import Queue
from threading import Thread, Lock
class Movie():
def __init__(self):
self.df = []
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4343.0 Safari/537.36',
'Referer': 'https://movie.douban.com/top250'}
self.columns = ['排名','电影名称','导演','上映年份','制作国家','类型','评分','评价分数','短评']
self.lock = Lock()
self.url_list = Queue()
def get_url(self):
url = 'https://movie.douban.com/top250?start={}&filter='
for i in range(0, 250, 25):
self.url_list.put(url.format(str(i)))
def get_html(self):
while True:
if not self.url_list.empty():
url = self.url_list.get()
resp = requests.get(url, headers=self.headers)
html = resp.text
self.xpath_parse(html)
else:
break
def xpath_parse(self, html):
xp = etree.HTML(html)
lis = xp.xpath('//*[@id="content"]/div/div[1]/ol/li')
for li in lis:
ranks = li.xpath('div/div[1]/em/text()')
titles = li.xpath('div/div[2]/div[1]/a/span[1]/text()')
directors = li.xpath('div/div[2]/div[2]/p[1]/text()')[0].strip().replace("\xa0\xa0\xa0","\t").split("\t")
infos = li.xpath('div/div[2]/div[2]/p[1]/text()')[1].strip().replace('\xa0','').split('/')
dates, areas, genres = infos[0], infos[1], infos[2]
ratings = li.xpath('.//div[@class="star"]/span[2]/text()')[0]
scores = li.xpath('.//div[@class="star"]/span[4]/text()')[0][:-3]
quotes = li.xpath('.//p[@class="quote"]/span/text()')
for rank, title, director in zip(ranks, titles, directors):
if len(quotes) == 0:
quotes = None
else:
quotes = quotes[0]
self.df.append([rank, title, director, dates, areas, genres, ratings, scores, quotes])
d = pd.DataFrame(self.df, columns=self.columns)
d.to_excel('douban.xlsx', index=False)
def main(self):
start_time = time.time()
self.get_url()
th_list = []
for i in range(5):
th = Thread(target=self.get_html)
th.start()
th_list.append(th)
for th in th_list:
th.join()
end_time = time.time()
print(end_time - start_time)
if __name__ == '__main__':
spider = Movie()
spider.main()After crawling, the data is read with pandas , cleaned (e.g., normalizing the release year), and visualized using pyecharts bar charts for year distribution, rating distribution, comment count top 10, and director ranking, as well as a word cloud for genres.
Additional snippets illustrate how to delete dictionary entries in Python using pop , del , and clear .
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.