How to Scrape Douban Movie Reviews in 12 Lines of Python
Learn to quickly build a Python web scraper using requests and Xpath to extract Douban movie 'Black Panther' short reviews, covering setup, HTTP request analysis, data parsing, storage with pandas, and best practices like polite crawling intervals, all demonstrated with concise 12-line code.
Many students and analysts turn to web scraping when data acquisition becomes difficult; this article guides you through a simple 12‑line Python crawler that extracts short comments for the movie "Black Panther" from Douban.
Scraping Goal
The demo uses requests + XPath to fetch a portion of Douban movie short reviews. The code is presented first.
import requests
from lxml import etree
import pandas as pd
import time
import random
from tqdm import tqdm
name, score, comment = [], [], []
def danye_crawl(page):
url = 'https://movie.douban.com/subject/6390825/comments?start=%s&limit=20&sort=new_score&status=P&percent_type=' % (page*20)
response = etree.HTML(requests.get(url).content.decode('utf-8'))
print('
', '第%s页评论爬取成功' % (page)) if requests.get(url).status_code == 200 else print('
', '第%s页爬取失败'(page))
for i in range(1,21):
name.append(response.xpath('//*[@id="comments"]/div[%s]/div[2]/h3/span[2]/a' % (i))[0].text)
score.append(response.xpath('//*[@id="comments"]/div[%s]/div[2]/h3/span[2]/span[2]' % (i))[0].attrib['class'][7])
comment.append(response.xpath('//*[@id="comments"]/div[%s]/div[2]/p' % (i))[0].text)
for i in tqdm(range(11)):
danye_crawl(i)
time.sleep(random.uniform(6,9))
res = pd.DataFrame({'name':name, 'score':score, 'comment':comment}, columns=['name','score','comment'])
res.to_csv('豆瓣.csv')The script successfully retrieves data, as shown by the matching screenshots.
Tool Preparation
Chrome browser (for HTTP request analysis and packet capture)
Python 3 and modules: requests , lxml , pandas , time , random , tqdm
requests: simple HTTP requests
lxml: fast and powerful HTML parsing
pandas: data handling powerhouse
time: set crawl intervals to avoid being blocked
random: generate random delays
tqdm: display progress bar
Basic Steps
Analyze network requests
Parse webpage content
Read and store data
Key Concepts Covered
Robots.txt (crawler protocol)
HTTP request analysis
requests library usage
XPath syntax
Basic Python syntax
Pandas data processing
Crawler Protocol (robots.txt)
The robots.txt file under a site's root tells crawlers what can be fetched. The Crawl‑delay directive suggests a polite interval; this demo uses a random 6‑9 second delay.
HTTP Request Analysis
Using Chrome's Network panel on the Douban "Black Panther" short‑review page, the target URL was identified as:
https://movie.douban.com/subject/6390825/comments?start=0&limit=20&sort=new_score&status=P&percent_type=Each subsequent page increments the start parameter by 20. After page 11, login is required, so the demo only crawls the first 11 pages.
Using requests
A GET request fetches the page; response.content.decode('utf-8') converts bytes to text. The status code is checked (200 = success) before proceeding.
XPath Parsing
XPath provides a fast way to extract usernames, scores, and comments from the HTML. Chrome can copy XPath expressions directly.
Data Processing
Extracted data is stored in lists, converted to a dictionary, then to a pandas DataFrame, and finally saved as a CSV file.
Conclusion and Extras
This demo shows how requests + XPath can quickly scrape Douban movie short reviews, providing a solid data foundation for text analysis or other mining tasks. Future articles will explore advanced topics such as custom headers, cookies, login simulation, and distributed crawling.
import requests
from lxml import etree
import pandas as pd
import time
import random
from tqdm import tqdm
name, score, comment = [], [], []
def danye_crawl(page):
url = 'https://movie.douban.com/subject/6390825/comments?start=%s&limit=20&sort=new_score&status=P&percent_type=' % (page*20)
response = requests.get(url)
response = etree.HTML(response.content.decode('utf-8'))
if requests.get(url).status_code == 200:
print('
', '第%s页评论爬取成功' % (page))
else:
print('
', '第%s页爬取失败'(page))
for i in range(1,21):
name_list = response.xpath('//*[@id="comments"]/div[%s]/div[2]/h3/span[2]/a' % (i))
score_list = response.xpath('//*[@id="comments"]/div[%s]/div[2]/h3/span[2]/span[2]' % (i))
comment_list = response.xpath('//*[@id="comments"]/div[%s]/div[2]/p' % (i))
name_element = name_list[0].text
score_element = score_list[0].attrib['class'][7]
comment_element = comment_list[0].text
name.append(name_element)
score.append(score_element)
comment.append(comment_element)
for i in tqdm(range(11)):
danye_crawl(i)
time.sleep(random.uniform(6,9))
res = {'name':name, 'score':score, 'comment':comment}
res = pd.DataFrame(res, columns=['name','score','comment'])
res.to_csv('豆瓣.csv')Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
