Backend Development 13 min read

Master Web Scraping with Beautiful Soup: A Step‑by‑Step Python Guide

This article introduces Beautiful Soup, a Python library for parsing HTML/XML into a navigable tree, covering installation, object initialization, tag navigation, attribute handling, searching techniques like find_all, CSS selectors, and practical code examples for effective web data extraction.

MaGe Linux Operations

Feb 3, 2019

Master Web Scraping with Beautiful Soup: A Step‑by‑Step Python Guide

Beautiful Soup is a Python library that extracts data from HTML or XML files by parsing them into a tree structure, making it easy to access specific tags and their attributes.

Using Beautiful Soup, you can directly retrieve data by providing class or id values as parameters.

The latest version is 4.4.0; Beautiful Soup 3 is no longer maintained. BS4 works with Python 2.7 and Python 3.x (the examples use Python 2.7).

Installation on macOS: sudo easy_install beautifulsoup4 After installation, test the library: from bs4 import BeautifulSoup If no error occurs, the library is installed correctly.

Start

The examples use the webpage http://reeoo.com as a demonstration.

BeautifulSoup Object Initialization

Pass a document string to the constructor to obtain a soup object. Example code fetching the URL:

#coding:utf-8
from bs4 import BeautifulSoup
import urllib2
url = 'http://reeoo.com'
request = urllib2.Request(url)
response = urllib2.urlopen(request, timeout=20)
content = response.read()
soup = BeautifulSoup(content, 'html.parser')

The second argument specifies the parser; if omitted, Beautiful Soup chooses the best parser with a warning.

You can also initialize from a file: soup = BeautifulSoup(open('reo.html')) Printing soup shows the original HTML as a complex tree of Python objects.

Tag

Tag objects correspond to HTML tags and can be accessed by name:

tag = soup.title
print tag

Output:

<title>Reeoo - web design inspiration and website gallery</title>

Name

Retrieve a tag’s name with the name attribute:

print tag.name  # title

Attributes

Tags may have attributes like class or id. Access them like a dictionary:

tag = soup.article
c = tag['class']
print c  # [u'box']
attrs = tag.attrs
print attrs  # {u'class': [u'box']}

Note that class is a multi‑value attribute, so its value is a list.

Tag String

Get the text inside a tag with the string property:

tag = soup.title
s = tag.string
print s  # Reeoo - web design inspiration and website gallery

Document Tree Traversal

Tags can contain other tags or strings as child nodes. Use properties like .contents, .children, and .descendants to explore them.

Example of accessing children:

tag = soup.article.div.ul
contents = tag.contents
children = tag.children
for child in children:
    print child

Use .parent to get a tag’s parent (e.g., article ’s parent is body) and .parents to iterate all ancestors.

tag = soup.article
print tag.parent.name  # body
for p in tag.parents:
    print p.name

Sibling nodes are accessed via .next_sibling and .previous_sibling.

Document Tree Search

The most common operation in web scraping is searching the tree.

find_all()

Signature:

find_all(name, attrs, recursive, string, **kwargs)

Name Parameter

Find all tags with a given name:

soup.find_all('title')
soup.find_all('footer')

Keyword Parameter

Search by attribute when the name is not a built‑in parameter: soup.find_all(id='footer') For the class attribute, use class_ because class is a Python keyword.

True/False Parameter

Find tags that have (or do not have) a specific attribute:

soup.find_all(target=True)
soup.find_all(target=False)

attrs Parameter

Pass a dictionary to search for attributes that cannot be used as keywords (e.g., data-original):

soup.find_all(attrs={'data-original': True})
soup.find_all(attrs={'data-original': re.compile('reeoo.com')})

string Parameter

Search for tags containing a specific string:

soup.find_all(string=re.compile('Reeoo'))

limit Parameter

Stop searching after a certain number of results:

soup.find_all('div', class_='thumb', limit=3)

recursive Parameter

Search only direct children when recursive=False.

find()

Works like find_all but returns only the first match (equivalent to limit=1).

soup.find('div', class_='thumb')

CSS Selectors

Use select() with CSS syntax to locate tags:

soup.select('article ul li')
soup.select('.thumb')
soup.select('#sponsor')
soup.select('li[id]')
soup.select('li[id="sponsor"]')

Other Search Methods

Additional methods include find_parents, find_next_siblings, find_previous_siblings, etc., which behave similarly to find_all and find.

Modifying the document tree is rarely needed for scraping; refer to the official Beautiful Soup documentation for more details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

html-parsing Python data extraction beautifulsoup

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.