Master BeautifulSoup: Quick Guide to Web Scraping with Python

This article introduces the BeautifulSoup library, explains how to install it, demonstrates core parsing methods such as find, find_all, select, and relationship navigation, and provides a complete example of scraping novel titles from Qidian using Python requests.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master BeautifulSoup: Quick Guide to Web Scraping with Python

Introduction

BeautifulSoup (bs) is a powerful HTML parsing library for Python that supports XPath and CSS selector syntax.

Installation

Install bs via pip or easy_install:

pip install Beautifulsoup4

Basic Usage

Typically you fetch a page with requests, then create a BeautifulSoup object to parse the HTML.

1. Retrieve elements directly by tag name.

2. Use find and find_all methods. The former returns a single result, the latter returns all matches. find(name, attrs, recursive, text, **kwargs) Because class is a Python keyword, use class_="value" or attrs={"class":"value"}.

3. Use select with CSS selectors:

soup.select('div')

All

elements

soup.select('#aa')

Elements with id="aa"

soup.select('.oo')

Elements with class="oo"

soup.select('div p')

All inside

soup.select('div > p')

Direct child

of

soup.select('input[name]')

All

elements that have a name attribute

soup.select('input[type="button"]')

All

elements with type="button" Examples:

soup.select('a')[0].get_text()  # get text of first <a>
soup.select('a')[0].attrs['href']  # get href of first <a>

Relationship Navigation

find_parents()  # list of all ancestor nodes
find_parent()   # direct parent node
print(soup.title.find_parent())
print(soup.title.find_parent().find_all('link')[1])
print(soup.title.find_parents())
find_next_sibling() / find_next_siblings()
find_previous_sibling() / find_previous_siblings()
find_next() / find_all_next()
find_previous() / find_all_previous()

Object Types

tag, NavigableString, BeautifulSoup, Comment
rep = requests.get('https://book.qidian.com/info/1014243481#Catalog', timeout=3)
soup = BeautifulSoup(rep.text, 'html.parser')
print(soup.name)  # BeautifulSoup object
tr = soup.div
print(type(tr), tr)  # tag object
print(tr.get_attribute_list('class'))
print(tr.a.string)  # NavigableString
soup.a.string.replace_with('fdf')  # replace string

Example: Scrape Qidian Novel List

Fetch the first page of Qidian's novel list and extract titles and links.

import requests
from bs4 import BeautifulSoup
rep = requests.get('https://www.qidian.com/all', timeout=3)
soup = BeautifulSoup(rep.text, 'html.parser')
ul = soup.find_all('ul', 'all-img-list cf')
for y in ul:
for z in y.find_all('div', 'book-mid-info'):
for x in z.find_all('h4'):
for v in x.find_all('a'):
print(v.get_text(), 'https:' + v.attrs['href'])

The script prints each novel title and its full URL.

Conclusion

This guide covered installing BeautifulSoup, its basic parsing methods, navigation techniques, object types, and a practical scraping example, showing how the library can greatly accelerate web data extraction tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

html-parsingPythonWeb Scrapingrequestsbeautifulsoup
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.