Introduction to Python Web Scraping: Basics, HTTP/HTTPS, Requests Library, Proxies, and Data Extraction

This article provides a comprehensive introduction to Python web scraping, covering the fundamental concepts of spiders, HTTP/HTTPS protocols, the Requests library usage, custom headers, proxies, cookies, and various data extraction techniques such as JSON parsing, XPath, and regular expressions.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Introduction to Python Web Scraping: Basics, HTTP/HTTPS, Requests Library, Proxies, and Data Extraction

Web scraping (spider) is a program that sends requests to websites, retrieves resources such as HTML, JSON, or binary data, and extracts useful information for further processing.

1. Basic Idea of a Spider

Obtain a web page via URL or file.

Analyze the location of the target content.

Use element selectors to quickly extract raw target content.

Process the extracted content, usually assembling it into JSON.

Store the processed data in a database (e.g., MongoDB) or a file.

2. Robots Protocol

Websites use the Robots protocol to indicate which pages can be crawled; it is a moral rather than a technical restriction.

3. Common Uses of Crawlers

Ticket grabbing (e.g., 12306).

SMS bombing.

Online voting.

Data monitoring.

Downloading images, novels, videos, music, etc.

4. HTTP and HTTPS

HTTP is the HyperText Transfer Protocol (default port 80) and is faster but insecure. HTTPS adds SSL/TLS encryption (default port 443) for secure data transmission, which is the mainstream for modern APIs.

5. Chrome Request Analysis

Understanding request headers, response status codes, and other details is essential for building effective crawlers.

6. Using the Requests Library

Install with pip install requests. Below are common usage patterns.

# Import the module
import requests
# Define the request URL
url = 'http://www.baidu.com'
# Send a GET request and get the response
response = requests.get(url)
# Get the HTML content as a string
html = response.text

Common response attributes: response.text – response body as a string. response.content – response body as bytes. response.status_code – HTTP status code. response.request.headers – request headers. response.headers – response headers. response.cookies – cookies object.

# Get byte data and decode to string
content = response.content
html = content.decode('utf-8')

Custom Request Headers

# Define custom headers
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}

GET Request with Parameters

# Define query parameters
params = {"kw": "hello"}
response = requests.get(url, headers=headers, params=params)
html = response.text

POST Request

# Define POST data
data = {"kw": "hello"}
response = requests.post(url, headers=headers, data=data)
html = response.text

7. Using Proxies

Proxies hide the real client IP and distribute requests.

# Define proxy servers
proxies = {
    "http": "http://IP地址:端口号",
    "https": "https://IP地址:端口号"
}
response = requests.get(url, headers=headers, proxies=proxies)
html = response.text

8. Sending Cookies

Cookies maintain login state.

# Include Cookie in headers
headers["Cookie"] = "Cookie值"
# Or use a cookies dict
cookies = {"xx": "yy"}
response = requests.get(url, headers=headers, cookies=cookies)
html = response.text

9. Data Extraction

After fetching pages, extract needed data using various methods.

JSON

Use the built‑in json module: json.loads() – parse JSON string to Python objects. json.dumps() – serialize Python objects to JSON string. json.load() – read JSON from a file. json.dump() – write Python objects to a file (use ensure_ascii=False for Chinese characters and indent for pretty printing).

XPath

XPath is used to navigate XML/HTML documents. Install with pip install lxml and apply expressions to select nodes.

Regular Expressions

# Import re module
import re
# Match using a pattern
result = re.match(正则表达式, 要匹配的字符串)
# Extract matched text
matched_text = result.group()

These techniques together form a complete workflow for building Python web crawlers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

HTTPData ExtractionWeb Scraping
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.