How to Scrape Book Data from epubit.com with Python: A Step‑by‑Step Guide
This tutorial walks you through analyzing a JavaScript‑driven website, discovering the hidden API, configuring request headers with Postman, and writing a Python scraper that extracts book titles, authors, codes, and prices from epubit.com.
1) Exploration Research
We start by creating a new Python file and sending a GET request to https://www.epubit.com/books using requests. The response contains only the page skeleton without book data because the site uses a front‑end/back‑end separation and loads data via JavaScript.
Two ways to obtain the data are: (1) analyze the subsequent AJAX request URL and parameters and replicate it, or (2) use a browser‑automation tool such as Selenium.
2) Analyze Subsequent Requests
Open Chrome DevTools, go to the Network tab, filter by XHR, and locate the request that returns the book list (e.g., getUBookList). The request URL looks like:
https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page=1&row=20&startPrice=&endPrice=&tagId=Parameters: page for page number, row for items per page, price filters are empty.
3) Test with Postman
Directly opening the URL in a browser returns an error because the server detects a non‑browser request. Adding the proper request headers, especially Origin-Domain: www.epubit.com, makes the request succeed. Other headers such as User-Agent and Cookie can also be required.
{
"code": "-7",
"data": null,
"msg": "系统临时开小差,请稍后再试~",
"success": false
}4) Write the Scraper
With the correct URL and headers, the Python code becomes simple:
import requests
def get_page(page=1):
'''Fetch data of the specified page'''
url = f'https://www.epubit.com/pubcloud/content/front/portal/getUbookList?page={page}&row=20&startPrice=&endPrice=&tagId='
headers = {'Origin-Domain': 'www.epubit.com'}
res = requests.get(url, headers=headers)
print(res.text)
get_page(5)5) Analyze JSON Data
The response is a JSON object with fields code, data, msg, and success. The data object contains current, pages, and a records array where each element represents a book with fields such as authors, code, name, price, etc.
{
"code": "0",
"data": {
"current": 1,
"pages": 144,
"records": [
{
"authors": "[美] Stephen Prata",
"code": "UB7209840d845c9",
"name": "C++ Primer Plus 第6版 中文版",
"price": 100.30
},
...
],
"size": 20,
"total": 2871
},
"msg": "成功",
"success": true
}6) Complete the Program
Define a Book class to store name, code, author, and price, and a parse_book function that loads the JSON, iterates over records, creates Book objects, and returns a list. Loop over the required pages, print each book, and pause between requests to avoid overloading the server.
class Book:
def __init__(self, name, code, author, price):
self.name = name
self.code = code
self.author = author
self.price = price
def __str__(self):
return f'书名:{self.name},作者:{self.author},价格:{self.price},编号:{self.code}'
import json
def parse_book(json_text):
'''Parse the JSON string and return a list of Book objects'''
books = []
book_json = json.loads(json_text)
records = book_json['data']['records']
for r in records:
author = r['authors']
name = r['name']
code = r['code']
price = r['price']
books.append(Book(name, code, author, price))
return books
all_books = []
for i in range(1, 10):
print(f'======抓取第{i}页======')
# Assume get_page(i) returns the JSON text
json_text = get_page(i)
books = parse_book(json_text)
for b in books:
print(b)
all_books.extend(books)
print('抓完一页,休息5秒钟...')
time.sleep(5)Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
