Master Web Scraping with Python: Regex, BeautifulSoup & Selenium
This guide demonstrates how to combine Python's regex, BeautifulSoup, and Selenium (including Chrome and headless PhantomJS) for powerful web scraping, covering tag matching, handling Ajax, iFrames, cookie management, and practical code examples for extracting and interacting with dynamic web content.
Using Regular Expressions
Previously we discussed regex for matching common patterns such as email, URL, and phone numbers. BeautifulSoup also supports regex, allowing you to match specific tags. Example: find all img tags whose src matches a pattern.
import re
tags = res.findAll("img", {"src": re.compile("\./\./uploads/photo_.*\.png")})This code replaces paths like ../uploads… with img tags, demonstrating the power of combining BeautifulSoup with regular expressions.
Developing a Web Scraping Tool with JavaScript
When a page loads content via Ajax or React, the URL may not change, so traditional scraping fails. You need a browser that can execute JavaScript; Selenium is a Python library that drives real browsers (Chrome, Firefox, Safari, Edge) or headless browsers.
Using Selenium to Scrape Web Pages
Install Selenium: $ pip install selenium Download the appropriate ChromeDriver and add it to your system PATH.
Example code to open Chrome, load a page, and extract navigation text:
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.python.org/")
nav = browser.find_element_by_id("mainnav")
print(nav.text)Result screenshot:
Headless Scraping with PhantomJS
PhantomJS runs without opening a visible browser, useful for background tasks. After downloading and placing it in your system path, you can use it as a Selenium driver.
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("https://www.python.org/")
print(browser.find_element_by_class_name("introduction").text)
browser.close()Various element‑finding methods are demonstrated (by id, CSS selector, link text, name) and their plural forms to return multiple elements.
Scraping iFrame Content
To access an iframe, switch Selenium’s context to the frame and retrieve its source or URL.
browser = webdriver.PhantomJS()
browser.get("https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
iframe = browser.find_element_by_tag_name("iframe")
browser.switch_to.frame(iframe)
iframe_source = browser.page_source
print(iframe_source)
print(browser.current_url)Alternatively, use BeautifulSoup to fetch the iframe’s src attribute and request it directly.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe")
soup = BeautifulSoup(html.read(), "html5lib")
tag = soup.find("iframe")
print(tag['src'])Handling Ajax Calls
After triggering an Ajax button, you can wait for a specific element’s text to appear using WebDriverWait and ExpectedConditions, then capture a screenshot.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.PhantomJS()
browser.get("https://resttesttest.com/")
browser.find_element_by_id("submitajax").click()
WebDriverWait(browser, 10).until(
EC.text_to_be_present_in_element((By.ID, "statuspre"), "HTTP200 OK")
)
browser.get_screenshot_as_file("image.png")
browser.close()Cookie Management
Cookies can be retrieved with get_cookies() and cleared with delete_all_cookies().
browser = webdriver.PhantomJS()
browser.get("https://www.21cto.com/")
print(browser.get_cookies())
browser.delete_all_cookies()Conclusion
This article covered Python web‑scraping techniques, including regex with BeautifulSoup, Selenium‑driven browsers (Chrome and PhantomJS), handling iFrames, Ajax, and cookies, illustrating how to collect and parse web data similarly to search‑engine crawlers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
