Backend Development 14 min read

Master Python’s Requests: From Basics to Advanced Web Scraping Techniques

This tutorial introduces Python’s Requests library, covering installation, core methods like GET, POST, PUT, PATCH, DELETE, detailed parameters, session handling, exception management, header customization, proxy usage, and practical code examples to empower effective web scraping.

Python Crawling & Data Mining

Jul 8, 2020

Master Python’s Requests: From Basics to Advanced Web Scraping Techniques

Requests is a Python library that simplifies HTTP GET and POST requests, wrapping the basic urllib module.

Install it via pip install requests or easy_install requests.

Basic Usage

Requests provides several convenient methods:

requests.request() : Construct a generic request.

requests.get() : Send a GET request and receive a response.

requests.head() : Retrieve only the response headers.

requests.post() : Submit data to the server, often used for form submissions.

requests.put() : Replace the target document with new data.

requests.patch() : Apply partial updates to a resource.

requests.delete() : Request the server to delete a specified resource.

request() Method Parameters

The request() method accepts many arguments, such as url, params, timeout, headers, auth, verify, proxies, cookies, allow_redirects, stream, and cert. These control the request URL, query parameters, timeout, custom headers, authentication, SSL verification, proxy settings, cookie handling, redirect behavior, streaming, and client certificates.

GET Method

GET is typically used to retrieve data. It returns a Response object with useful attributes: response.url: The final URL. response.status_code: HTTP status code. response.encoding: Detected encoding. response.cookies: Cookie information. response.headers: Response headers. response.content: Raw bytes. response.text: Decoded string. response.json(): Parsed JSON as a dictionary.

POST Method

POST is commonly used for form submissions, file uploads, or sending JSON payloads.

It can also upload files (illustrated below) and send JSON data.

PUT Method

PUT replaces the content of a specified document on the server with data from the client.

PATCH Method

PATCH submits partial updates to a URL.

DELETE Method

DELETE requests the server to remove the specified resource.

Advanced Operations

Session Persistence

# Simulate Taobao login
import requests
url='https://login.taobao.com/member/login.jhtml?redirectURL=https%3A%2F%2Fai.taobao.com%2F%3Fpid%3Dmm_26632323_6762370_25910879'
headers={'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
formdata={'TPL_username':'fsdafdfasf','TPL_password':'fsadfasf'}
se=requests.session()  # create session
ss=se.post(url=url, headers=headers, data=formdata)
if ss.status_code==200:
    print('登录成功')
else:
    print('登录失败')

Exception Handling

Common exceptions include Timeout, ConnectionError, and TooManyRedirects. All explicit exceptions inherit from requests.exceptions.RequestException.

Example of a failed request (illustrated below):

Certificate Verification

import requests
from requests.packages import urllib3
urllib3.disable_warnings()
rep = requests.get("https://www.baidu.com", verify=False)
print(rep.status_code)

Cookie Parsing

cookie={'Cookie':'_NTES_PASSPORT=...'}
for i in Cookie.split(';'):
    k,v = i.split('=')
    cookie[k]=v
for k,v in cookie.items():
    print(k, ':', v)
# Convert dict to CookieJar and back
cookiesJar = requests.utils.cookiejar_from_dict(cookie, cookiejar=None, overwrite=True)
print(requests.utils.dict_from_cookiejar(cookiesJar))

Browser Emulation (Headers)

Common request headers and their purposes:

Accept : Content types the client can handle (e.g., text/html, application/xml).

Accept-Encoding : Compression algorithms supported (e.g., gzip, deflate).

Accept-Language : Preferred languages (e.g., zh-CN, en-US).

User-Agent : Identifies the client software, OS, and browser version.

Connection : Indicates whether to keep the TCP connection alive.

Host : The target server’s domain name.

Referer : The URL of the page that linked to the requested resource.

Using Proxy Servers

import urllib.request
import http.cookiejar
url = "https://www.baidu.com"
headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, br',
    'Accept-Language':'zh-CN,zh;q=0.9',
    'Cache-Control':'max-age=0',
    'Connection':'keep-alive',
    'Cookie':'BAIDUID=...; ...',
    'Host':'www.baidu.com',
    'Sec-Fetch-Mode':'navigate',
    'Sec-Fetch-Site':'cross-site',
    'Sec-Fetch-User':'?1',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
jar = http.cookiejar.CookieJar()
proxy = urllib.request.ProxyHandler({'http': "127.0.0.1:8000"})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler, urllib.request.HTTPCookieProcessor(jar))
head = []
for k,v in headers.items():
    head.append((k,v))
opener.addheaders = head
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url).read()
with open(r"C:\Users\Administrator\Desktop\et.html", "wb") as f:
    f.write(data)

Without Proxy Server

import urllib.request
import http.cookiejar
url = "https://www.baidu.com"
headers = { ... same as above ... }
jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPHandler, urllib.request.HTTPCookieProcessor(jar))
head = []
for k,v in headers.items():
    head.append((k,v))
opener.addheaders = head
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url).read()
with open(r"C:\Users\Administrator\Desktop\et.html", "wb") as f:
    f.write(data)

Conclusion

This article examined seven commonly used methods of the requests library, providing code snippets and explanations to help readers effectively perform web scraping and HTTP interactions with Python.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python exception handling HTTP API Web Scraping requests Session

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.