Master Web Scraping with Beautiful Soup: A Step‑by‑Step Python Guide
This article introduces Beautiful Soup, a Python library for parsing HTML/XML into a navigable tree, covering installation, object initialization, tag navigation, attribute handling, searching techniques like find_all, CSS selectors, and practical code examples for effective web data extraction.
Beautiful Soup is a Python library that extracts data from HTML or XML files by parsing them into a tree structure, making it easy to access specific tags and their attributes.
Using Beautiful Soup, you can directly retrieve data by providing class or id values as parameters.
The latest version is 4.4.0; Beautiful Soup 3 is no longer maintained. BS4 works with Python 2.7 and Python 3.x (the examples use Python 2.7).
Installation on macOS: sudo easy_install beautifulsoup4 After installation, test the library: from bs4 import BeautifulSoup If no error occurs, the library is installed correctly.
Start
The examples use the webpage http://reeoo.com as a demonstration.
BeautifulSoup Object Initialization
Pass a document string to the constructor to obtain a soup object. Example code fetching the URL:
#coding:utf-8
from bs4 import BeautifulSoup
import urllib2
url = 'http://reeoo.com'
request = urllib2.Request(url)
response = urllib2.urlopen(request, timeout=20)
content = response.read()
soup = BeautifulSoup(content, 'html.parser')The second argument specifies the parser; if omitted, Beautiful Soup chooses the best parser with a warning.
You can also initialize from a file: soup = BeautifulSoup(open('reo.html')) Printing soup shows the original HTML as a complex tree of Python objects.
Tag
Tag objects correspond to HTML tags and can be accessed by name:
tag = soup.title
print tagOutput:
<title>Reeoo - web design inspiration and website gallery</title>Name
Retrieve a tag’s name with the name attribute:
print tag.name # titleAttributes
Tags may have attributes like class or id. Access them like a dictionary:
tag = soup.article
c = tag['class']
print c # [u'box']
attrs = tag.attrs
print attrs # {u'class': [u'box']}Note that class is a multi‑value attribute, so its value is a list.
Tag String
Get the text inside a tag with the string property:
tag = soup.title
s = tag.string
print s # Reeoo - web design inspiration and website galleryDocument Tree Traversal
Tags can contain other tags or strings as child nodes. Use properties like .contents, .children, and .descendants to explore them.
Example of accessing children:
tag = soup.article.div.ul
contents = tag.contents
children = tag.children
for child in children:
print childUse .parent to get a tag’s parent (e.g., article ’s parent is body) and .parents to iterate all ancestors.
tag = soup.article
print tag.parent.name # body
for p in tag.parents:
print p.nameSibling nodes are accessed via .next_sibling and .previous_sibling.
Document Tree Search
The most common operation in web scraping is searching the tree.
find_all()
Signature:
find_all(name, attrs, recursive, string, **kwargs)Name Parameter
Find all tags with a given name:
soup.find_all('title')
soup.find_all('footer')Keyword Parameter
Search by attribute when the name is not a built‑in parameter: soup.find_all(id='footer') For the class attribute, use class_ because class is a Python keyword.
True/False Parameter
Find tags that have (or do not have) a specific attribute:
soup.find_all(target=True)
soup.find_all(target=False)attrs Parameter
Pass a dictionary to search for attributes that cannot be used as keywords (e.g., data-original):
soup.find_all(attrs={'data-original': True})
soup.find_all(attrs={'data-original': re.compile('reeoo.com')})string Parameter
Search for tags containing a specific string:
soup.find_all(string=re.compile('Reeoo'))limit Parameter
Stop searching after a certain number of results:
soup.find_all('div', class_='thumb', limit=3)recursive Parameter
Search only direct children when recursive=False.
find()
Works like find_all but returns only the first match (equivalent to limit=1).
soup.find('div', class_='thumb')CSS Selectors
Use select() with CSS syntax to locate tags:
soup.select('article ul li')
soup.select('.thumb')
soup.select('#sponsor')
soup.select('li[id]')
soup.select('li[id="sponsor"]')Other Search Methods
Additional methods include find_parents, find_next_siblings, find_previous_siblings, etc., which behave similarly to find_all and find.
Modifying the document tree is rarely needed for scraping; refer to the official Beautiful Soup documentation for more details.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
