Introduction to Python Web Scraping: Basics, HTTP/HTTPS, Requests Library, Proxies, and Data Extraction
This article provides a comprehensive introduction to Python web scraping, covering the fundamental concepts of spiders, HTTP/HTTPS protocols, the Requests library usage, custom headers, proxies, cookies, and various data extraction techniques such as JSON parsing, XPath, and regular expressions.
Web scraping (spider) is a program that sends requests to websites, retrieves resources such as HTML, JSON, or binary data, and extracts useful information for further processing.
1. Basic Idea of a Spider
Obtain a web page via URL or file.
Analyze the location of the target content.
Use element selectors to quickly extract raw target content.
Process the extracted content, usually assembling it into JSON.
Store the processed data in a database (e.g., MongoDB) or a file.
2. Robots Protocol
Websites use the Robots protocol to indicate which pages can be crawled; it is a moral rather than a technical restriction.
3. Common Uses of Crawlers
Ticket grabbing (e.g., 12306).
SMS bombing.
Online voting.
Data monitoring.
Downloading images, novels, videos, music, etc.
4. HTTP and HTTPS
HTTP is the HyperText Transfer Protocol (default port 80) and is faster but insecure. HTTPS adds SSL/TLS encryption (default port 443) for secure data transmission, which is the mainstream for modern APIs.
5. Chrome Request Analysis
Understanding request headers, response status codes, and other details is essential for building effective crawlers.
6. Using the Requests Library
Install with pip install requests. Below are common usage patterns.
# Import the module
import requests
# Define the request URL
url = 'http://www.baidu.com'
# Send a GET request and get the response
response = requests.get(url)
# Get the HTML content as a string
html = response.textCommon response attributes: response.text – response body as a string. response.content – response body as bytes. response.status_code – HTTP status code. response.request.headers – request headers. response.headers – response headers. response.cookies – cookies object.
# Get byte data and decode to string
content = response.content
html = content.decode('utf-8')Custom Request Headers
# Define custom headers
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}GET Request with Parameters
# Define query parameters
params = {"kw": "hello"}
response = requests.get(url, headers=headers, params=params)
html = response.textPOST Request
# Define POST data
data = {"kw": "hello"}
response = requests.post(url, headers=headers, data=data)
html = response.text7. Using Proxies
Proxies hide the real client IP and distribute requests.
# Define proxy servers
proxies = {
"http": "http://IP地址:端口号",
"https": "https://IP地址:端口号"
}
response = requests.get(url, headers=headers, proxies=proxies)
html = response.text8. Sending Cookies
Cookies maintain login state.
# Include Cookie in headers
headers["Cookie"] = "Cookie值"
# Or use a cookies dict
cookies = {"xx": "yy"}
response = requests.get(url, headers=headers, cookies=cookies)
html = response.text9. Data Extraction
After fetching pages, extract needed data using various methods.
JSON
Use the built‑in json module: json.loads() – parse JSON string to Python objects. json.dumps() – serialize Python objects to JSON string. json.load() – read JSON from a file. json.dump() – write Python objects to a file (use ensure_ascii=False for Chinese characters and indent for pretty printing).
XPath
XPath is used to navigate XML/HTML documents. Install with pip install lxml and apply expressions to select nodes.
Regular Expressions
# Import re module
import re
# Match using a pattern
result = re.match(正则表达式, 要匹配的字符串)
# Extract matched text
matched_text = result.group()These techniques together form a complete workflow for building Python web crawlers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
