Master Python Web Scraping: From Requests to Selenium and Scrapy
Learn how to efficiently scrape web pages using Python by exploring multiple approaches—including simple requests with BeautifulSoup, fast parsing with lxml, dynamic content extraction with Selenium, and large‑scale crawling with Scrapy—complete with installation steps, code snippets, and detailed explanations.
Web Scraping with Python
In data science and web crawling, web scraping is essential. Python is popular due to powerful third‑party libraries that simplify HTML parsing and data extraction.
1. Using requests and BeautifulSoup
1.1 Install dependencies
pip install requests beautifulsoup41.2 Basic usage
import requests
from bs4 import BeautifulSoup
# Send HTTP GET request
url = "https://www.example.com"
response = requests.get(url)
# Parse HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract title
title = soup.title.text
print("网页标题:", title)
# Extract all links
links = soup.find_all('a')
for link in links:
href = link.get('href')
print("链接:", href) requests.get(url)sends a GET request and returns a response object. BeautifulSoup(response.text, 'html.parser') parses the HTML. soup.title.text gets the page title. soup.find_all('a') finds all anchor tags.
2. Using requests and lxml
2.1 Install dependencies
pip install requests lxml2.2 Basic usage
import requests
from lxml import html
url = "https://quotes.toscrape.com/"
response = requests.get(url)
tree = html.fromstring(response.text)
quotes = tree.xpath('//div[@class="quote"]')
for quote in quotes:
text = quote.xpath('.//span[@class="text"]/text()')[0]
author = quote.xpath('.//small[@class="author"]/text()')[0]
print(f"名言: {text}, 作者: {author}") lxmlprovides XPath support, allowing flexible element selection, especially useful for complex page structures.
3. Using Selenium for dynamic pages
3.1 Install dependencies
pip install selenium3.2 Basic usage
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://quotes.toscrape.com/js/")
time.sleep(2)
quotes = driver.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
text = quote.find_element(By.CLASS_NAME, 'text').text
author = quote.find_element(By.CLASS_NAME, 'author').text
print(f"名言: {text}, 作者: {author}")
driver.quit()Selenium can capture content rendered by JavaScript, simulating real browser actions such as clicking, scrolling, and form filling.
4. Using Scrapy framework
4.1 Install Scrapy
pip install scrapy4.2 Create a Scrapy project
scrapy startproject myspider4.3 Write a Scrapy Spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['https://quotes.toscrape.com/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)4.4 Run the spider
scrapy crawl quotesScrapy is a powerful framework for large‑scale crawling, supporting concurrency, automatic cookie handling, retries, pagination, and data export to JSON, CSV, or databases.
5. Other scraping methods
5.1 pyquery
pip install pyquery
from pyquery import PyQuery as pq
url = "https://quotes.toscrape.com/"
doc = pq(url)
for quote in doc('.quote').items():
text = quote('.text').text()
author = quote('.author').text()
print(f"名言: {text}, 作者: {author}")5.2 requests-html
pip install requests-html
from requests_html import HTMLSession
session = HTMLSession()
url = "https://quotes.toscrape.com/js/"
response = session.get(url)
response.html.render()
quotes = response.html.find('.quote')
for quote in quotes:
text = quote.find('.text', first=True).text
author = quote.find('.author', first=True).text
print(f"名言: {text}, 作者: {author}")Python offers multiple powerful web‑scraping techniques suitable for different types of sites: requests + BeautifulSoup for static pages, Selenium for JavaScript‑driven content, and Scrapy for large‑scale projects. Choosing the right tool enables efficient data collection for analysis, content aggregation, and many other applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
