Backend Development 7 min read

How to Scrape Dangdang Bestselling Books with Selenium and Python

This tutorial walks you through installing Selenium and ChromeDriver, configuring the environment, and using Python code to automatically navigate Dangdang's bestseller pages, extract book details with pyquery, and save the results into a CSV file for further analysis.

Python Crawling & Data Mining

Jul 2, 2021

How to Scrape Dangdang Bestselling Books with Selenium and Python

Introduction

In the previous article we crawled a news site; this article demonstrates using Selenium to scrape Dangdang's bestseller list, showing how to automate browser actions and extract book information.

Preparation

Install the Selenium library and ensure Chrome and ChromeDriver versions match.

pip install selenium

pip install your_driver.whl

ChromeDriver Installation

Check Chrome version via Help → About, download the corresponding ChromeDriver from the official site, and place the executable in a directory that is on the system PATH (e.g., Python's Scripts folder).

Scraping Process

Use Selenium to open each bestseller page, retrieve the page source, and parse it with pyquery to extract rank, title, image URL, price, comments, and other metadata.

browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10)

def index_page(page):
    print('正在爬取第', page, '页')
    try:
        url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-' + str(page)
        browser.get(url)
        get_booklist()
    except TimeoutException:
        index_page(page)

def get_booklist():
    html = browser.page_source
    doc = pq(html)
    items = doc('.bang_list li').items()
    for item in items:
        book = {
            '排名': item.find('.list_num').text(),
            '书名': item.find('.name').text(),
            '图片': item.find('.pic img').attr('src'),
            '评论数': item.find('.star a').text(),
            '推荐': item.find('.tuijian').text(),
            '作者': item.find('.publisher_info a').text(),
            '日期': item.find('.publisher_info span').text(),
            '原价': item.find('.price_r').text().replace('¥', ''),
            '折扣': item.find('.price_s').text(),
            '电子书': item.find('.price_e').text().replace('电子书：', '').replace('¥', '')
        }
        saving_book(book)

with open('data.csv', 'a', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['排名','书名','图片','评论数','推荐','作者','原价','折扣','电子书'])

def saving_book(book):
    with open('data.csv', 'a', newline='') as csfile:
        writer = csv.writer(csfile)
        writer.writerow([
            book.get('排名'),
            book.get('书名'),
            book.get('图片'),
            book.get('评论数'),
            book.get('推荐'),
            book.get('作者'),
            book.get('原价'),
            book.get('折扣'),
            book.get('电子书')
        ])

Iterating Pages

Loop over the desired page numbers (example shows pages 1‑2) and call index_page for each.

if __name__ == '__main__':
    for page in range(1, 3):
        index_page(page)

Result

The script writes each book's details to data.csv, which can be opened in spreadsheet software for further analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Automation data extraction CSV Web Scraping Selenium

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.