Master Python Web Scraping: From Requests to Selenium and Scrapy
Learn how to efficiently scrape web pages using Python by exploring multiple approaches—including simple requests with BeautifulSoup, fast parsing with lxml, dynamic content extraction with Selenium, and large‑scale crawling with Scrapy—complete with installation steps, code snippets, and detailed explanations.
Web Scraping with Python
In data science and web crawling, web scraping is essential. Python is popular due to powerful third‑party libraries that simplify HTML parsing and data extraction.
1. Using requests and BeautifulSoup
1.1 Install dependencies
<code>pip install requests beautifulsoup4</code>1.2 Basic usage
<code>import requests
from bs4 import BeautifulSoup
# Send HTTP GET request
url = "https://www.example.com"
response = requests.get(url)
# Parse HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract title
title = soup.title.text
print("网页标题:", title)
# Extract all links
links = soup.find_all('a')
for link in links:
href = link.get('href')
print("链接:", href)</code>requests.get(url) sends a GET request and returns a response object. BeautifulSoup(response.text, 'html.parser') parses the HTML. soup.title.text gets the page title. soup.find_all('a') finds all anchor tags.
2. Using requests and lxml
2.1 Install dependencies
<code>pip install requests lxml</code>2.2 Basic usage
<code>import requests
from lxml import html
url = "https://quotes.toscrape.com/"
response = requests.get(url)
tree = html.fromstring(response.text)
quotes = tree.xpath('//div[@class="quote"]')
for quote in quotes:
text = quote.xpath('.//span[@class="text"]/text()')[0]
author = quote.xpath('.//small[@class="author"]/text()')[0]
print(f"名言: {text}, 作者: {author}")</code>lxml provides XPath support, allowing flexible element selection, especially useful for complex page structures.
3. Using Selenium for dynamic pages
3.1 Install dependencies
<code>pip install selenium</code>3.2 Basic usage
<code>from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://quotes.toscrape.com/js/")
time.sleep(2)
quotes = driver.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
text = quote.find_element(By.CLASS_NAME, 'text').text
author = quote.find_element(By.CLASS_NAME, 'author').text
print(f"名言: {text}, 作者: {author}")
driver.quit()</code>Selenium can capture content rendered by JavaScript, simulating real browser actions such as clicking, scrolling, and form filling.
4. Using Scrapy framework
4.1 Install Scrapy
<code>pip install scrapy</code>4.2 Create a Scrapy project
<code>scrapy startproject myspider</code>4.3 Write a Scrapy Spider
<code>import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['https://quotes.toscrape.com/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)</code>4.4 Run the spider
<code>scrapy crawl quotes</code>Scrapy is a powerful framework for large‑scale crawling, supporting concurrency, automatic cookie handling, retries, pagination, and data export to JSON, CSV, or databases.
5. Other scraping methods
5.1 pyquery
<code>pip install pyquery
from pyquery import PyQuery as pq
url = "https://quotes.toscrape.com/"
doc = pq(url)
for quote in doc('.quote').items():
text = quote('.text').text()
author = quote('.author').text()
print(f"名言: {text}, 作者: {author}")</code>5.2 requests-html
<code>pip install requests-html
from requests_html import HTMLSession
session = HTMLSession()
url = "https://quotes.toscrape.com/js/"
response = session.get(url)
response.html.render()
quotes = response.html.find('.quote')
for quote in quotes:
text = quote.find('.text', first=True).text
author = quote.find('.author', first=True).text
print(f"名言: {text}, 作者: {author}")</code>Python offers multiple powerful web‑scraping techniques suitable for different types of sites: requests + BeautifulSoup for static pages, Selenium for JavaScript‑driven content, and Scrapy for large‑scale projects. Choosing the right tool enables efficient data collection for analysis, content aggregation, and many other applications.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.