How to Scrape Dangdang Bestselling Books with Selenium and Python
This tutorial walks you through installing Selenium and ChromeDriver, configuring the environment, and using Python code to automatically navigate Dangdang's bestseller pages, extract book details with pyquery, and save the results into a CSV file for further analysis.
Introduction
In the previous article we crawled a news site; this article demonstrates using Selenium to scrape Dangdang's bestseller list, showing how to automate browser actions and extract book information.
Preparation
Install the Selenium library and ensure Chrome and ChromeDriver versions match.
pip install selenium pip install your_driver.whlChromeDriver Installation
Check Chrome version via Help → About, download the corresponding ChromeDriver from the official site, and place the executable in a directory that is on the system PATH (e.g., Python's Scripts folder).
Scraping Process
Use Selenium to open each bestseller page, retrieve the page source, and parse it with pyquery to extract rank, title, image URL, price, comments, and other metadata.
browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10)
def index_page(page):
print('正在爬取第', page, '页')
try:
url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-' + str(page)
browser.get(url)
get_booklist()
except TimeoutException:
index_page(page) def get_booklist():
html = browser.page_source
doc = pq(html)
items = doc('.bang_list li').items()
for item in items:
book = {
'排名': item.find('.list_num').text(),
'书名': item.find('.name').text(),
'图片': item.find('.pic img').attr('src'),
'评论数': item.find('.star a').text(),
'推荐': item.find('.tuijian').text(),
'作者': item.find('.publisher_info a').text(),
'日期': item.find('.publisher_info span').text(),
'原价': item.find('.price_r').text().replace('¥', ''),
'折扣': item.find('.price_s').text(),
'电子书': item.find('.price_e').text().replace('电子书:', '').replace('¥', '')
}
saving_book(book) with open('data.csv', 'a', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['排名','书名','图片','评论数','推荐','作者','原价','折扣','电子书']) def saving_book(book):
with open('data.csv', 'a', newline='') as csfile:
writer = csv.writer(csfile)
writer.writerow([
book.get('排名'),
book.get('书名'),
book.get('图片'),
book.get('评论数'),
book.get('推荐'),
book.get('作者'),
book.get('原价'),
book.get('折扣'),
book.get('电子书')
])Iterating Pages
Loop over the desired page numbers (example shows pages 1‑2) and call index_page for each.
if __name__ == '__main__':
for page in range(1, 3):
index_page(page)Result
The script writes each book's details to data.csv, which can be opened in spreadsheet software for further analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
