Backend Development 8 min read

Master BeautifulSoup: Quick Guide to Web Scraping with Python

This article introduces the BeautifulSoup library, explains how to install it, demonstrates core parsing methods such as find, find_all, select, and relationship navigation, and provides a complete example of scraping novel titles from Qidian using Python requests.

Python Crawling & Data Mining

Apr 6, 2021

Master BeautifulSoup: Quick Guide to Web Scraping with Python

Introduction

BeautifulSoup (bs) is a powerful HTML parsing library for Python that supports XPath and CSS selector syntax.

Installation

Install bs via pip or easy_install:

pip install Beautifulsoup4

Basic Usage

Typically you fetch a page with requests, then create a BeautifulSoup object to parse the HTML.

1. Retrieve elements directly by tag name.

2. Use find and find_all methods. The former returns a single result, the latter returns all matches. find(name, attrs, recursive, text, **kwargs) Because class is a Python keyword, use class_="value" or attrs={"class":"value"}.

3. Use select with CSS selectors:

soup.select('div')

All

elements

soup.select('#aa')

Elements with id="aa"

soup.select('.oo')

Elements with class="oo"

soup.select('div p')

All inside

soup.select('div > p')

Direct child

soup.select('input[name]')

All

elements that have a name attribute

soup.select('input[type="button"]')

All

elements with type="button" Examples:

soup.select('a')[0].get_text()  # get text of first <a>

soup.select('a')[0].attrs['href']  # get href of first <a>

Relationship Navigation

find_parents()  # list of all ancestor nodes

find_parent()   # direct parent node

print(soup.title.find_parent())

print(soup.title.find_parent().find_all('link')[1])

print(soup.title.find_parents())

find_next_sibling() / find_next_siblings()

find_previous_sibling() / find_previous_siblings()

find_next() / find_all_next()

find_previous() / find_all_previous()

Object Types

tag, NavigableString, BeautifulSoup, Comment

rep = requests.get('https://book.qidian.com/info/1014243481#Catalog', timeout=3)

soup = BeautifulSoup(rep.text, 'html.parser')

print(soup.name)  # BeautifulSoup object

tr = soup.div

print(type(tr), tr)  # tag object

print(tr.get_attribute_list('class'))

print(tr.a.string)  # NavigableString

soup.a.string.replace_with('fdf')  # replace string

Example: Scrape Qidian Novel List

Fetch the first page of Qidian's novel list and extract titles and links.

import requests

from bs4 import BeautifulSoup

rep = requests.get('https://www.qidian.com/all', timeout=3)

soup = BeautifulSoup(rep.text, 'html.parser')

ul = soup.find_all('ul', 'all-img-list cf')

for y in ul:

for z in y.find_all('div', 'book-mid-info'):

for x in z.find_all('h4'):

for v in x.find_all('a'):

print(v.get_text(), 'https:' + v.attrs['href'])

The script prints each novel title and its full URL.

Conclusion

This guide covered installing BeautifulSoup, its basic parsing methods, navigation techniques, object types, and a practical scraping example, showing how the library can greatly accelerate web data extraction tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing Python Web Scraping requests beautifulsoup

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.