Backend Development 10 min read

Master Web Scraping with Python: Regex, BeautifulSoup & Selenium

This guide demonstrates how to combine Python's regex, BeautifulSoup, and Selenium (including Chrome and headless PhantomJS) for powerful web scraping, covering tag matching, handling Ajax, iFrames, cookie management, and practical code examples for extracting and interacting with dynamic web content.

21CTO

Dec 15, 2017

Master Web Scraping with Python: Regex, BeautifulSoup & Selenium

Using Regular Expressions

Previously we discussed regex for matching common patterns such as email, URL, and phone numbers. BeautifulSoup also supports regex, allowing you to match specific tags. Example: find all img tags whose src matches a pattern.

import re
tags = res.findAll("img", {"src": re.compile("\./\./uploads/photo_.*\.png")})

This code replaces paths like ../uploads… with img tags, demonstrating the power of combining BeautifulSoup with regular expressions.

Developing a Web Scraping Tool with JavaScript

When a page loads content via Ajax or React, the URL may not change, so traditional scraping fails. You need a browser that can execute JavaScript; Selenium is a Python library that drives real browsers (Chrome, Firefox, Safari, Edge) or headless browsers.

Using Selenium to Scrape Web Pages

Install Selenium: $ pip install selenium Download the appropriate ChromeDriver and add it to your system PATH.

Example code to open Chrome, load a page, and extract navigation text:

from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.python.org/")
nav = browser.find_element_by_id("mainnav")
print(nav.text)

Result screenshot:

Headless Scraping with PhantomJS

PhantomJS runs without opening a visible browser, useful for background tasks. After downloading and placing it in your system path, you can use it as a Selenium driver.

from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("https://www.python.org/")
print(browser.find_element_by_class_name("introduction").text)
browser.close()

Various element‑finding methods are demonstrated (by id, CSS selector, link text, name) and their plural forms to return multiple elements.

Scraping iFrame Content

To access an iframe, switch Selenium’s context to the frame and retrieve its source or URL.

browser = webdriver.PhantomJS()
browser.get("https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
iframe = browser.find_element_by_tag_name("iframe")
browser.switch_to.frame(iframe)
iframe_source = browser.page_source
print(iframe_source)
print(browser.current_url)

Alternatively, use BeautifulSoup to fetch the iframe’s src attribute and request it directly.

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
soup = BeautifulSoup(html.read(), "html5lib")
tag = soup.find("iframe")
print(tag['src'])

Handling Ajax Calls

After triggering an Ajax button, you can wait for a specific element’s text to appear using WebDriverWait and ExpectedConditions, then capture a screenshot.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.PhantomJS()
browser.get("https://resttesttest.com/")
browser.find_element_by_id("submitajax").click()
WebDriverWait(browser, 10).until(
    EC.text_to_be_present_in_element((By.ID, "statuspre"), "HTTP200 OK")
)
browser.get_screenshot_as_file("image.png")
browser.close()

Cookie Management

Cookies can be retrieved with get_cookies() and cleared with delete_all_cookies().

browser = webdriver.PhantomJS()
browser.get("https://www.21cto.com/")
print(browser.get_cookies())
browser.delete_all_cookies()

Conclusion

This article covered Python web‑scraping techniques, including regex with BeautifulSoup, Selenium‑driven browsers (Chrome and PhantomJS), handling iFrames, Ajax, and cookies, illustrating how to collect and parse web data similarly to search‑engine crawlers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

regex Web Scraping Selenium Headless Browser beautifulsoup

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.