Fundamentals 6 min read

Master the Most Common HTTP Request Headers for Web Scraping

This guide explains the essential HTTP request header fields—Accept, Accept‑Encoding, Accept‑Language, User‑Agent, Connection, and Host—detailing their meanings, typical values, and how to use them to disguise a Python crawler and reliably fetch web pages.

Python Crawling & Data Mining

Jul 18, 2020

Master the Most Common HTTP Request Headers for Web Scraping

When learning web crawling, you often press F12 or right‑click → Inspect to view the request headers, which are crucial for disguising the browser and silently retrieving page data; however, these headers are usually in English and can be confusing.

Common Field (1): Accept

Accept: text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8

The Accept header indicates which content types the browser can handle, with optional quality factors (q) ranging from 0 to 1 that define preference order.

Common Field (2): Accept‑Encoding

Accept-Encoding: gzip, deflate

This header tells the server which compression encodings the client supports, such as gzip and deflate .

Common Field (3): Accept‑Language

Accept-Language: zh-CN, zh;q=0.8, en-US;q=0.5, en;q=0.3

The Accept-Language header lists the languages the browser prefers, e.g., simplified Chinese (zh‑CN), generic Chinese (zh), US English (en‑US), and generic English (en).

Common Field (4): User‑Agent

User-Agent: Mozilla/5.0 (Windows NT6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0

The User-Agent string identifies the browser name, version, operating system, and rendering engine; it is often spoofed to mimic a real browser during crawling.

Common Field (5): Connection

Connection: keep-alive

The Connection header specifies the type of network connection; keep-alive means a persistent connection, while close would terminate it after the request.

Common Field (6): Host and Referer

Host: www.youku.com

The Host header indicates the target server’s domain name. The Referer header (not shown in code) reveals the source URL from which the request originated.

Conclusion

This article covered six frequently used HTTP request header fields that are essential for Python web crawlers to disguise themselves and fetch data more effectively.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python HTTP request headers User-Agent network crawling

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.