Backend Development 14 min read

How to Scrape Douban Book Data and Analyze It with Python

This tutorial shows how to collect book metadata such as publisher, publication date, ISBN, price, rating and review count from Douban for a list of titles stored in Excel, using Python requests, lxml XPath parsing, pandas for merging and analysis, and visualizing the results with matplotlib.

Python Crawling & Data Mining

Mar 24, 2019

How to Scrape Douban Book Data and Analyze It with Python

The author needed to obtain detailed information (publisher, publication date, ISBN, price, rating, number of ratings) for a large number of books listed in an Excel file by scraping Douban book pages.

Requirement Source

The task originated from organizing a personal reading list and required batch extraction of book attributes, which could not be found in existing articles.

Scraping Process

Initially the search URL

https://book.douban.com/subject_search?search_text={0}&cat=1001

was tried, but the returned HTML contained no data. The author then discovered that the suggestion endpoint returns a JSON payload with useful fields such as title, url, and pic. The first entry of this JSON is used for further processing.

Example request for a book name:

https://book.douban.com/j/subject_suggest?q={book_name}

Basic Code

import json
import requests
import pandas as pd
from lxml import etree

# Read the Excel file containing the list of book titles
bsdf = pd.read_excel('booklistfortest.xlsx')
blst = list(bsdf['书名'])  # list of titles

# Function to parse the "info" block of a book page using XPath
def getBookInfo(binfo, cc):
    i = 0
    rss = {}
    clw = []
    for c in cc:
        if c in c:
            clw.append(c)
        else:
            clw.append(c)
    for m in binfo[0]:
        if m.tag == 'span':
            mlst = m.getchildren()
            if len(mlst) == 0:
                k = m.text.replace(':', '')
                if k in clw[i]:
                    f = 1
                else:
                    v = clw[i].replace('
', '').replace(' ', '')
                i += 1
            elif len(mlst) > 0:
                for n in mlst:
                    if n.tag == 'span':
                        k = n.text.replace('
', '').replace(' ', '')
                    elif n.tag == 'a':
                        v = n.text.replace('
', '').replace(' ', '')
        elif m.tag == 'a':
            if f == 1:
                v = m.text.replace('
', '').replace(' ', '')
                f = 0
        elif m.tag == 'br':
            if k == '':
                print(i, 'err')
            else:
                rss[k] = v
        else:
            print(m.tag, i)
    return rss

The main loop iterates over each book title, fetches the suggestion JSON, retrieves the detailed page URL, parses the HTML with lxml.etree, extracts the book name, ID, and calls getBookInfo to obtain a dictionary of attributes, then merges the result into a list.

rlst = []
for bn in blst:
    res = {}
    r = requests.get(f'https://book.douban.com/j/subject_suggest?q={bn}')
    rj = json.loads(r.text)
    html = requests.get(rj[0]['url'])
    con = etree.HTML(html.text)
    bname = con.xpath('//*[@id="wrapper"]/h1/span/text()')[0]
    res['book_name_original'] = bn
    res['book_name'] = bname
    res['dbid'] = rj[0]['id']
    binfo = con.xpath('//*[@id="info"]')
    cc = con.xpath('//*[@id="info"]/text()')
    res.update(getBookInfo(binfo, cc))
    # rating and number of reviews
    bmark = con.xpath('//*[@id="interest_sectl"]/div/div[2]/strong/text()')
    if bmark:
        res['rating'] = bmark[0]
        bmnum = con.xpath('//*[@id="interest_sectl"]/div/div[2]/div/div[2]/span/a/span/text()')[0]
        res['review_count'] = bmnum
    else:
        res['rating'] = 'No rating'
        res['review_count'] = 'Insufficient reviews'
    rlst.append(res)

outdf = pd.DataFrame(rlst)
outdf.to_excel('out_douban_binfo.xlsx', index=False)

Basic Data Statistics and Analysis

The scraped data is merged with the original Excel data:

bdf = bsdf.merge(outdf, on='书名', how='left')
print(f"Total books: {len(bdf)}, authors: {len(set(bdf['作者']))}, publishers: {len(set(bdf['出版社']))}")

Author and publisher frequency are displayed with value_counts().head(7), and monthly reading counts are calculated by converting the reading date to %Y-%m and using value_counts(). A line chart of monthly counts is plotted with matplotlib.

Pivot tables are used to examine reading patterns across years, and a box‑plot visualizes the distribution of Douban ratings.

Data Output

The final DataFrame is exported to an Excel file for further use. The author notes that the approach relies on XPath parsing; the same task could be implemented with BeautifulSoup, but the analysis‑→HTML‑tree methodology remains applicable.

The article concludes that the presented approach successfully extracts the required book information from Douban, and although the code is concise and lacks extensive error handling, it serves as a solid foundation for similar scraping tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data analysis Pandas XPath

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.