Master BeautifulSoup: Quick Guide to Web Scraping with Python
This article introduces the BeautifulSoup library, explains how to install it, demonstrates core parsing methods such as find, find_all, select, and relationship navigation, and provides a complete example of scraping novel titles from Qidian using Python requests.
Introduction
BeautifulSoup (bs) is a powerful HTML parsing library for Python that supports XPath and CSS selector syntax.
Installation
Install bs via pip or easy_install:
pip install Beautifulsoup4Basic Usage
Typically you fetch a page with requests, then create a BeautifulSoup object to parse the HTML.
1. Retrieve elements directly by tag name.
2. Use find and find_all methods. The former returns a single result, the latter returns all matches. find(name, attrs, recursive, text, **kwargs) Because class is a Python keyword, use class_="value" or attrs={"class":"value"}.
3. Use select with CSS selectors:
soup.select('div')
All
elements
soup.select('#aa')
Elements with id="aa"
soup.select('.oo')
Elements with class="oo"
soup.select('div p')
All inside
soup.select('div > p')
Direct child
of
soup.select('input[name]')
All
elements that have a name attribute
soup.select('input[type="button"]')
All
elements with type="button" Examples:
soup.select('a')[0].get_text() # get text of first <a> soup.select('a')[0].attrs['href'] # get href of first <a>Relationship Navigation
find_parents() # list of all ancestor nodes find_parent() # direct parent node print(soup.title.find_parent()) print(soup.title.find_parent().find_all('link')[1]) print(soup.title.find_parents()) find_next_sibling() / find_next_siblings() find_previous_sibling() / find_previous_siblings() find_next() / find_all_next() find_previous() / find_all_previous()Object Types
tag, NavigableString, BeautifulSoup, Comment rep = requests.get('https://book.qidian.com/info/1014243481#Catalog', timeout=3) soup = BeautifulSoup(rep.text, 'html.parser') print(soup.name) # BeautifulSoup object tr = soup.div print(type(tr), tr) # tag object print(tr.get_attribute_list('class')) print(tr.a.string) # NavigableString soup.a.string.replace_with('fdf') # replace stringExample: Scrape Qidian Novel List
Fetch the first page of Qidian's novel list and extract titles and links.
import requests from bs4 import BeautifulSoup rep = requests.get('https://www.qidian.com/all', timeout=3) soup = BeautifulSoup(rep.text, 'html.parser') ul = soup.find_all('ul', 'all-img-list cf') for y in ul: for z in y.find_all('div', 'book-mid-info'): for x in z.find_all('h4'): for v in x.find_all('a'): print(v.get_text(), 'https:' + v.attrs['href'])The script prints each novel title and its full URL.
Conclusion
This guide covered installing BeautifulSoup, its basic parsing methods, navigation techniques, object types, and a practical scraping example, showing how the library can greatly accelerate web data extraction tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
