How to Scrape Douban TV Shows with Python: From Requests to CSV
This tutorial walks through building a Python web scraper that fetches Douban TV show titles, ratings, detail links, and images, parses JSON responses, handles anti‑scraping measures, and stores the results in a CSV file with step‑by‑step code examples.
Project Background
Douban Movie provides the latest movie information, reviews, showtimes, ticketing, and lets users record movies they want to watch, are watching, or have watched, as well as rate and write reviews, greatly facilitating daily life.
Project Goal
Obtain the movie name, rating, detail link, download the movie image, and save everything to a document.
Libraries and Websites
URL template:
https://movie.douban.com/j/search_subjects?type=tv&tag=美剧&sort=recommend&page_limit=20&page_start={}Libraries used: requests , fake_useragent , json , csv .
IDE: PyCharm.
Project Analysis
1. How to request multiple pages?
Increase the page_start parameter by 20 for each page and iterate with a for loop.
2. How to get the real request URL?
Douban loads data via JavaScript; use browser dev tools (Network tab) to find the request URL and preview the JSON.
In the JSON, title is the movie name and rate is the rating.
3. How to construct page URLs?
Example URLs:
https://movie.douban.com/j/search_subjects?type=tv&tag=美剧&sort=recommend&page_limit=20&page_start=0
https://movie.douban.com/j/search_subjects?type=tv&tag=美剧&sort=recommend&page_limit=20&page_start=20
https://movie.douban.com/j/search_subjects?type=tv&tag=美剧&sort=recommend&page_limit=20&page_start=40
https://movie.douban.com/j/search_subjects?type=tv&tag=美剧&sort=recommend&page_limit=20&page_start=60Implementation
Class definition
import requests, json
from fake_useragent import UserAgent
import csv
class Doban(object):
def __init__(self):
self.url = "https://movie.douban.com/j/search_subjects?type=tv&tag=美剧&sort=recommend&page_limit=20&page_start={}"
def main(self):
pass
if __name__ == '__main__':
Siper = Doban()
Siper.main()Random UserAgent
for i in range(1, 50):
self.headers = {
'User-Agent': ua.random,
}Send request and get page
def get_page(self, url):
res = requests.get(url=url, headers=self.headers)
html = res.content.decode("utf-8")
return htmlParse JSON
data = json.loads(html)['subjects']
# print(data[0])Iterate and extract fields
print(name, goblin_herf)
html2 = self.get_page(goblin_herf) # second request
parse_html2 = etree.HTML(html2)
r = parse_html2.xpath('//div[@class="entry"]/p/text()')Write to CSV
# Create CSV file
csv_file = open('scr.csv', 'a', encoding='gbk')
csv_writer = csv.writer(csv_file)
# Write header
csv_writer.writerow(['电影', '评分', '详情页'])
# Write data
csv_writer.writerow([id, rate, urll])Download images
html2 = requests.get(url=urll, headers=self.headers).content
dirname = "./图/" + id + ".jpg"
with open(dirname, 'wb') as f:
f.write(html2)
print("%s 【下载成功!!!!】" % id)Run workflow
html = self.get_page(url)
self.parse_page(html)Optimization
Set time delay: time.sleep(1.4) Use a variable u to track page number.
Results
Running the script shows progress, saves the CSV file, and downloads movie images as demonstrated in the screenshots.
Conclusion
Do not scrape excessive data to avoid overloading the server. This tutorial covered the main challenges of parsing JSON, handling dynamic content, and avoiding anti‑scraping measures. It also demonstrated basic CSV handling, string formatting, and image downloading with Python.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
