Backend Development 10 min read

Master Python Web Scraping: From Requests to Selenium and Scrapy

Learn how to efficiently scrape web pages using Python by exploring multiple approaches—including simple requests with BeautifulSoup, fast parsing with lxml, dynamic content extraction with Selenium, and large‑scale crawling with Scrapy—complete with installation steps, code snippets, and detailed explanations.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Master Python Web Scraping: From Requests to Selenium and Scrapy

Web Scraping with Python

In data science and web crawling, web scraping is essential. Python is popular due to powerful third‑party libraries that simplify HTML parsing and data extraction.

1. Using requests and BeautifulSoup

1.1 Install dependencies

<code>pip install requests beautifulsoup4</code>

1.2 Basic usage

<code>import requests
from bs4 import BeautifulSoup

# Send HTTP GET request
url = "https://www.example.com"
response = requests.get(url)

# Parse HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract title
title = soup.title.text
print("网页标题:", title)

# Extract all links
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    print("链接:", href)</code>

requests.get(url) sends a GET request and returns a response object. BeautifulSoup(response.text, 'html.parser') parses the HTML. soup.title.text gets the page title. soup.find_all('a') finds all anchor tags.

2. Using requests and lxml

2.1 Install dependencies

<code>pip install requests lxml</code>

2.2 Basic usage

<code>import requests
from lxml import html

url = "https://quotes.toscrape.com/"
response = requests.get(url)

tree = html.fromstring(response.text)

quotes = tree.xpath('//div[@class="quote"]')
for quote in quotes:
    text = quote.xpath('.//span[@class="text"]/text()')[0]
    author = quote.xpath('.//small[@class="author"]/text()')[0]
    print(f"名言: {text}, 作者: {author}")</code>

lxml provides XPath support, allowing flexible element selection, especially useful for complex page structures.

3. Using Selenium for dynamic pages

3.1 Install dependencies

<code>pip install selenium</code>

3.2 Basic usage

<code>from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://quotes.toscrape.com/js/")
time.sleep(2)

quotes = driver.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, 'text').text
    author = quote.find_element(By.CLASS_NAME, 'author').text
    print(f"名言: {text}, 作者: {author}")

driver.quit()</code>

Selenium can capture content rendered by JavaScript, simulating real browser actions such as clicking, scrolling, and form filling.

4. Using Scrapy framework

4.1 Install Scrapy

<code>pip install scrapy</code>

4.2 Create a Scrapy project

<code>scrapy startproject myspider</code>

4.3 Write a Scrapy Spider

<code>import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['https://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)</code>

4.4 Run the spider

<code>scrapy crawl quotes</code>

Scrapy is a powerful framework for large‑scale crawling, supporting concurrency, automatic cookie handling, retries, pagination, and data export to JSON, CSV, or databases.

5. Other scraping methods

5.1 pyquery

<code>pip install pyquery
from pyquery import PyQuery as pq

url = "https://quotes.toscrape.com/"
doc = pq(url)
for quote in doc('.quote').items():
    text = quote('.text').text()
    author = quote('.author').text()
    print(f"名言: {text}, 作者: {author}")</code>

5.2 requests-html

<code>pip install requests-html
from requests_html import HTMLSession

session = HTMLSession()
url = "https://quotes.toscrape.com/js/"
response = session.get(url)
response.html.render()
quotes = response.html.find('.quote')
for quote in quotes:
    text = quote.find('.text', first=True).text
    author = quote.find('.author', first=True).text
    print(f"名言: {text}, 作者: {author}")</code>

Python offers multiple powerful web‑scraping techniques suitable for different types of sites: requests + BeautifulSoup for static pages, Selenium for JavaScript‑driven content, and Scrapy for large‑scale projects. Choosing the right tool enables efficient data collection for analysis, content aggregation, and many other applications.

Pythonweb scrapingScrapySeleniumRequestsBeautifulSoup
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.