Backend Development 13 min read

How to Scrape Book Data from epubit.com with Python: A Step‑by‑Step Guide

This tutorial walks you through analyzing a JavaScript‑driven website, discovering the hidden API, configuring request headers with Postman, and writing a Python scraper that extracts book titles, authors, codes, and prices from epubit.com.

Python Crawling & Data Mining

Jul 19, 2021

How to Scrape Book Data from epubit.com with Python: A Step‑by‑Step Guide

1) Exploration Research

We start by creating a new Python file and sending a GET request to https://www.epubit.com/books using requests. The response contains only the page skeleton without book data because the site uses a front‑end/back‑end separation and loads data via JavaScript.

Two ways to obtain the data are: (1) analyze the subsequent AJAX request URL and parameters and replicate it, or (2) use a browser‑automation tool such as Selenium.

2) Analyze Subsequent Requests

Open Chrome DevTools, go to the Network tab, filter by XHR, and locate the request that returns the book list (e.g., getUBookList). The request URL looks like:

https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page=1&row=20&startPrice=&endPrice=&tagId=

Parameters: page for page number, row for items per page, price filters are empty.

3) Test with Postman

Directly opening the URL in a browser returns an error because the server detects a non‑browser request. Adding the proper request headers, especially Origin-Domain: www.epubit.com, makes the request succeed. Other headers such as User-Agent and Cookie can also be required.

{
    "code": "-7",
    "data": null,
    "msg": "系统临时开小差，请稍后再试~",
    "success": false
}

4) Write the Scraper

With the correct URL and headers, the Python code becomes simple:

import requests

def get_page(page=1):
    '''Fetch data of the specified page'''
    url = f'https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page={page}&row=20&startPrice=&endPrice=&tagId='
    headers = {'Origin-Domain': 'www.epubit.com'}
    res = requests.get(url, headers=headers)
    print(res.text)

get_page(5)

5) Analyze JSON Data

The response is a JSON object with fields code, data, msg, and success. The data object contains current, pages, and a records array where each element represents a book with fields such as authors, code, name, price, etc.

{
    "code": "0",
    "data": {
        "current": 1,
        "pages": 144,
        "records": [
            {
                "authors": "[美] Stephen Prata",
                "code": "UB7209840d845c9",
                "name": "C++ Primer Plus 第6版 中文版",
                "price": 100.30
            },
            ...
        ],
        "size": 20,
        "total": 2871
    },
    "msg": "成功",
    "success": true
}

6) Complete the Program

Define a Book class to store name, code, author, and price, and a parse_book function that loads the JSON, iterates over records, creates Book objects, and returns a list. Loop over the required pages, print each book, and pause between requests to avoid overloading the server.

class Book:
    def __init__(self, name, code, author, price):
        self.name = name
        self.code = code
        self.author = author
        self.price = price
    def __str__(self):
        return f'书名：{self.name}，作者：{self.author}，价格：{self.price}，编号：{self.code}'

import json

def parse_book(json_text):
    '''Parse the JSON string and return a list of Book objects'''
    books = []
    book_json = json.loads(json_text)
    records = book_json['data']['records']
    for r in records:
        author = r['authors']
        name = r['name']
        code = r['code']
        price = r['price']
        books.append(Book(name, code, author, price))
    return books

all_books = []
for i in range(1, 10):
    print(f'======抓取第{i}页======')
    # Assume get_page(i) returns the JSON text
    json_text = get_page(i)
    books = parse_book(json_text)
    for b in books:
        print(b)
    all_books.extend(books)
    print('抓完一页，休息5秒钟...')
    time.sleep(5)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python json data extraction Web Scraping requests Postman

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.