Backend Development 8 min read

Python Web Scraping Tutorial Using Selenium and BeautifulSoup

This article explains how to build a Python web scraper that logs into a site with Selenium, extracts HTML content, parses it with BeautifulSoup and html5lib, and processes table data, while also covering anti‑scraping techniques and required libraries.

Python Programming Learning Circle

Nov 26, 2020

Python Web Scraping Tutorial Using Selenium and BeautifulSoup

Web scraping is increasingly used by large e‑commerce companies to collect competitor data and research new products, and it involves extracting information from websites by traversing the HTML tree.

Data can be collected via public APIs or, when unavailable, through web scraping, which requires handling thousands of resources.

The article defines web scraping as extracting information from HTML pages, where tags form a nested tree rooted at <html>, and explains that the goal is to convert unstructured HTML into structured data for storage.

Typical steps include obtaining the target page URL, downloading its HTML, processing the HTML to retrieve needed data, and optionally logging into the site before scraping.

The tutorial uses the Python libraries BeautifulSoup for parsing HTML and Selenium for automating browser interactions.

# Import libraries<br/>from selenium import webdriver<br/>from bs4 import BeautifulSoup

# Chrome driver path<br/>chromedriver = '/usr/local/bin/chromedriver'<br/>options = webdriver.ChromeOptions()<br/>options.add_argument('headless')  # open a headless browser<br/>browser = webdriver.Chrome(executable_path=chromedriver, chrome_options=options)

# Navigate to login page<br/>browser.get('http://playsports365.com/default.aspx')<br/>email = browser.find_element_by_name('ctl00$MainContent$ctlLogin$_UserName')<br/>password = browser.find_element_by_name('ctl00$MainContent$ctlLogin$_Password')<br/>login = browser.find_element_by_name('ctl00$MainContent$ctlLogin$BtnSubmit')

# Submit credentials<br/>email.send_keys('********')<br/>password.send_keys('*******')<br/>login.click()

# After login, go to target page and get HTML<br/>browser.get('http://playsports365.com/wager/OpenBets.aspx')<br/>requiredHtml = browser.page_source

# Parse HTML with BeautifulSoup and html5lib<br/>soup = BeautifulSoup(requiredHtml, 'html5lib')<br/>table = soup.findChildren('table')[0]

# Iterate rows and print cell values<br/>rows = table.findChildren(['th', 'tr'])<br/>for row in rows:<br/>    cells = row.findChildren('td')<br/>    for cell in cells:<br/>        value = cell.text<br/>        print(value)

To run the program, install Selenium, BeautifulSoup, and html5lib via pip and execute the script with python <script_name>. The output is printed to the console.

The article also discusses anti‑scraping measures such as rotating user‑agents, using proxies, Tor, or commercial proxy services to avoid 403 errors and IP bans, recommending proxy providers for large‑scale data collection.

web-scraping data-extraction html5lib

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.