How to Scrape Douban Book Data and Analyze It with Python
This tutorial shows how to collect book metadata such as publisher, publication date, ISBN, price, rating and review count from Douban for a list of titles stored in Excel, using Python requests, lxml XPath parsing, pandas for merging and analysis, and visualizing the results with matplotlib.
The author needed to obtain detailed information (publisher, publication date, ISBN, price, rating, number of ratings) for a large number of books listed in an Excel file by scraping Douban book pages.
Requirement Source
The task originated from organizing a personal reading list and required batch extraction of book attributes, which could not be found in existing articles.
Scraping Process
Initially the search URL
https://book.douban.com/subject_search?search_text={0}&cat=1001was tried, but the returned HTML contained no data. The author then discovered that the suggestion endpoint returns a JSON payload with useful fields such as title, url, and pic. The first entry of this JSON is used for further processing.
Example request for a book name:
https://book.douban.com/j/subject_suggest?q={book_name}Basic Code
import json
import requests
import pandas as pd
from lxml import etree
# Read the Excel file containing the list of book titles
bsdf = pd.read_excel('booklistfortest.xlsx')
blst = list(bsdf['书名']) # list of titles
# Function to parse the "info" block of a book page using XPath
def getBookInfo(binfo, cc):
i = 0
rss = {}
clw = []
for c in cc:
if c in c:
clw.append(c)
else:
clw.append(c)
for m in binfo[0]:
if m.tag == 'span':
mlst = m.getchildren()
if len(mlst) == 0:
k = m.text.replace(':', '')
if k in clw[i]:
f = 1
else:
v = clw[i].replace('
', '').replace(' ', '')
i += 1
elif len(mlst) > 0:
for n in mlst:
if n.tag == 'span':
k = n.text.replace('
', '').replace(' ', '')
elif n.tag == 'a':
v = n.text.replace('
', '').replace(' ', '')
elif m.tag == 'a':
if f == 1:
v = m.text.replace('
', '').replace(' ', '')
f = 0
elif m.tag == 'br':
if k == '':
print(i, 'err')
else:
rss[k] = v
else:
print(m.tag, i)
return rssThe main loop iterates over each book title, fetches the suggestion JSON, retrieves the detailed page URL, parses the HTML with lxml.etree, extracts the book name, ID, and calls getBookInfo to obtain a dictionary of attributes, then merges the result into a list.
rlst = []
for bn in blst:
res = {}
r = requests.get(f'https://book.douban.com/j/subject_suggest?q={bn}')
rj = json.loads(r.text)
html = requests.get(rj[0]['url'])
con = etree.HTML(html.text)
bname = con.xpath('//*[@id="wrapper"]/h1/span/text()')[0]
res['book_name_original'] = bn
res['book_name'] = bname
res['dbid'] = rj[0]['id']
binfo = con.xpath('//*[@id="info"]')
cc = con.xpath('//*[@id="info"]/text()')
res.update(getBookInfo(binfo, cc))
# rating and number of reviews
bmark = con.xpath('//*[@id="interest_sectl"]/div/div[2]/strong/text()')
if bmark:
res['rating'] = bmark[0]
bmnum = con.xpath('//*[@id="interest_sectl"]/div/div[2]/div/div[2]/span/a/span/text()')[0]
res['review_count'] = bmnum
else:
res['rating'] = 'No rating'
res['review_count'] = 'Insufficient reviews'
rlst.append(res)
outdf = pd.DataFrame(rlst)
outdf.to_excel('out_douban_binfo.xlsx', index=False)Basic Data Statistics and Analysis
The scraped data is merged with the original Excel data:
bdf = bsdf.merge(outdf, on='书名', how='left')
print(f"Total books: {len(bdf)}, authors: {len(set(bdf['作者']))}, publishers: {len(set(bdf['出版社']))}")Author and publisher frequency are displayed with value_counts().head(7), and monthly reading counts are calculated by converting the reading date to %Y-%m and using value_counts(). A line chart of monthly counts is plotted with matplotlib.
Pivot tables are used to examine reading patterns across years, and a box‑plot visualizes the distribution of Douban ratings.
Data Output
The final DataFrame is exported to an Excel file for further use. The author notes that the approach relies on XPath parsing; the same task could be implemented with BeautifulSoup, but the analysis‑→HTML‑tree methodology remains applicable.
The article concludes that the presented approach successfully extracts the required book information from Douban, and although the code is concise and lacks extensive error handling, it serves as a solid foundation for similar scraping tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
