Backend Development 7 min read

Scraping Douban Top 250 Movies with Python and Analyzing Yearly Distribution

This tutorial demonstrates how to use Python's requests and BeautifulSoup libraries to scrape the titles and release years of the 250 movies listed on Douban, clean the extracted data, output it for Excel, and then create a pivot table and chart to visualize the yearly distribution of top films.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Scraping Douban Top 250 Movies with Python and Analyzing Yearly Distribution

The article begins by introducing the need to parse HTML data using the BeautifulSoup module, showing how to import it with from bs4 import BeautifulSoup and create a soup object via soup = BeautifulSoup(html.text, 'html.parser') .

It then explains how to iterate over the 25 movie entries on a Douban page using for item in soup.find_all('div', "info"): , extracting each movie's title with title = item.div.a.span.string and locating the year line with yearline = item.find('div','bd').p.contents[2].string .

The year line is cleaned by removing spaces and newline characters using yearline = yearline.replace(' ', '') and yearline = yearline.replace('\n', '') , then the first four characters are sliced to obtain the release year: year = yearline[0:4] . The result is printed with a tab separator: print(title, '\t', year) .

To collect all 250 movies, the script repeats the request for successive pages by adjusting the URL parameter start (e.g., https://movie.douban.com/top250?start=25 ) inside a loop: for n in range(0,10): with start += 25 after each request.

After copying the printed output into Excel, the guide shows how to insert a pivot table, drag the year field to rows and values, set the aggregation to count, and then generate a column chart to visualize the number of top movies released each year.

Finally, the article interprets the chart, noting that the early 1990s saw a rise in high‑quality films, the mid‑1990s maintained a high level, 2010 had the most top movies, and recent years show a slight decline with 2017 being a low point.

ExcelWeb Scrapingdata-analysisBeautifulSoupPivot Tablemovies
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.