Scraping and Analyzing Douban Top250 Movies with Python

This tutorial shows how to use Python to crawl Douban's Top250 movie list, handle anti‑scraping measures, extract detailed fields, store the data in Excel, and perform data cleaning, statistical analysis, and visualizations such as year distribution, rating trends, and genre word clouds.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Scraping and Analyzing Douban Top250 Movies with Python

This article demonstrates how to use Python to crawl Douban's Top250 movie list, extract fields such as rank, title, director, year, country, genre, rating, number of reviews, and short comments, and then analyze and visualize the data.

It first explains pagination by modifying the start parameter in the URL and discusses anti‑scraping techniques, including setting appropriate User-Agent, Referer, and handling cookies or JavaScript‑generated parameters.

Basic extraction code using requests and lxml.etree is provided:

# -*- coding: utf-8 -*-
# @Author: Kun
import requests
from lxml import etree
import pandas as pd

df = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4343.0 Safari/537.36',
           'Referer': 'https://movie.douban.com/top250'}
columns = ['排名','电影名称','导演','上映年份','制作国家','类型','评分','评价分数','短评']

def get_data(html):
    xp = etree.HTML(html)
    lis = xp.xpath('//*[@id="content"]/div/div[1]/ol/li')
    for li in lis:
        ranks = li.xpath('div/div[1]/em/text()')
        titles = li.xpath('div/div[2]/div[1]/a/span[1]/text()')
        directors = li.xpath('div/div[2]/div[2]/p[1]/text()')[0].strip().replace("\xa0\xa0\xa0","\t").split("\t")
        infos = li.xpath('div/div[2]/div[2]/p[1]/text()')[1].strip().replace('\xa0','').split('/')
        dates, areas, genres = infos[0], infos[1], infos[2]
        ratings = li.xpath('.//div[@class="star"]/span[2]/text()')[0]
        scores = li.xpath('.//div[@class="star"]/span[4]/text()')[0][:-3]
        quotes = li.xpath('.//p[@class="quote"]/span/text()')
        for rank, title, director in zip(ranks, titles, directors):
            if len(quotes) == 0:
                quotes = None
            else:
                quotes = quotes[0]
            df.append([rank, title, director, dates, areas, genres, ratings, scores, quotes])
    d = pd.DataFrame(df, columns=columns)
    d.to_excel('Top250.xlsx', index=False)

for i in range(0, 251, 25):
    url = "https://movie.douban.com/top250?start={}&filter=".format(str(i))
    res = requests.get(url, headers=headers)
    html = res.text
    get_data(html)

A multithreaded version using threading and a queue is also shown to speed up crawling.

# -*- coding: utf-8 -*-
import pandas as pd
import time
import requests
from lxml import etree
from queue import Queue
from threading import Thread, Lock

class Movie():
    def __init__(self):
        self.df = []
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4343.0 Safari/537.36',
                        'Referer': 'https://movie.douban.com/top250'}
        self.columns = ['排名','电影名称','导演','上映年份','制作国家','类型','评分','评价分数','短评']
        self.lock = Lock()
        self.url_list = Queue()
    def get_url(self):
        url = 'https://movie.douban.com/top250?start={}&filter='
        for i in range(0, 250, 25):
            self.url_list.put(url.format(str(i)))
    def get_html(self):
        while True:
            if not self.url_list.empty():
                url = self.url_list.get()
                resp = requests.get(url, headers=self.headers)
                html = resp.text
                self.xpath_parse(html)
            else:
                break
    def xpath_parse(self, html):
        xp = etree.HTML(html)
        lis = xp.xpath('//*[@id="content"]/div/div[1]/ol/li')
        for li in lis:
            ranks = li.xpath('div/div[1]/em/text()')
            titles = li.xpath('div/div[2]/div[1]/a/span[1]/text()')
            directors = li.xpath('div/div[2]/div[2]/p[1]/text()')[0].strip().replace("\xa0\xa0\xa0","\t").split("\t")
            infos = li.xpath('div/div[2]/div[2]/p[1]/text()')[1].strip().replace('\xa0','').split('/')
            dates, areas, genres = infos[0], infos[1], infos[2]
            ratings = li.xpath('.//div[@class="star"]/span[2]/text()')[0]
            scores = li.xpath('.//div[@class="star"]/span[4]/text()')[0][:-3]
            quotes = li.xpath('.//p[@class="quote"]/span/text()')
            for rank, title, director in zip(ranks, titles, directors):
                if len(quotes) == 0:
                    quotes = None
                else:
                    quotes = quotes[0]
                self.df.append([rank, title, director, dates, areas, genres, ratings, scores, quotes])
        d = pd.DataFrame(self.df, columns=self.columns)
        d.to_excel('douban.xlsx', index=False)
    def main(self):
        start_time = time.time()
        self.get_url()
        th_list = []
        for i in range(5):
            th = Thread(target=self.get_html)
            th.start()
            th_list.append(th)
        for th in th_list:
            th.join()
        end_time = time.time()
        print(end_time - start_time)

if __name__ == '__main__':
    spider = Movie()
    spider.main()

After crawling, the data is read with pandas, cleaned (e.g., normalizing the release year), and visualized using pyecharts bar charts for year distribution, rating distribution, comment count top 10, and director ranking, as well as a word cloud for genres.

Additional snippets illustrate how to delete dictionary entries in Python using pop, del, and clear.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multithreadingpandasPyechartsdata-analysisweb-scraping
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.