What Is a Web Crawler? Definitions, Types, and How It Works
This article explains web crawlers—what they are, their classifications, typical use cases, and step‑by‑step workflow—covers the robots protocol, then delves into HTTP and HTTPS fundamentals, request/response structures, common methods, headers, status codes, and the security trade‑offs of HTTPS.
Web Crawler Definition, Types, and Process
Definition
Web crawlers (also called spiders or bots) simulate a browser by sending network requests, receiving responses, and automatically fetching Internet information according to certain rules. The more a crawler mimics a real browser, the harder it is to detect. In principle, any action a browser can perform, a crawler can perform.
Classification
General crawler: typically used by search engines.
Focused crawler: targets a specific website.
Uses
News aggregation (e.g., Toutiao)
Music platforms (e.g., NetEase Cloud Music)
Ticket booking (e.g., 12306)
Website voting
SMS bombing
etc.
Process
Send a request to the start URL and obtain the response.
Extract data from the response.
If a URL is extracted, send another request to retrieve its response.
If data is extracted, save the data.
Robots Protocol
Websites use the robots protocol to tell search engines which pages may be crawled; it is a moral, not technical, constraint (e.g., Taobao's robots.txt).
HTTP and HTTPS Concepts
HTTP
HTTP (Hypertext Transfer Protocol) is an application‑layer client/server communication protocol composed of requests and responses and is stateless. The protocol defines the data format that both parties must follow.
HTTP Request Flow
Browser resolves the domain name via DNS to get the IP address.
Browser sends a request to the IP and receives a response.
The response HTML contains URLs for CSS, JS, images, and AJAX; the browser sequentially requests these resources.
Each received response is rendered, and scripts may trigger further requests.
The whole sequence constitutes the browser’s rendering process.
Five‑Layer Network Model
URL Format
Format: scheme://host[:port]/path/…/[?query‑string][#anchor]
scheme – protocol (e.g., http, https, ftp)
host – server IP or domain name
port – server port (default 80 for HTTP)
path – resource path
query‑string – parameters sent to the server
anchor – fragment identifier
HTTP Request Methods
According to the HTTP standard, various request methods exist.
HTTP/1.0 defines GET, POST, HEAD.
HTTP/1.1 adds OPTIONS, PUT, DELETE, TRACE, CONNECT.
Common methods:
GET : retrieve a page and its body.
HEAD : like GET but returns only headers.
POST : submit data to be processed (e.g., form submission).
PUT : replace the target document with the supplied data.
DELETE : request deletion of the specified resource.
CONNECT : used by proxies to establish a tunnel.
OPTIONS : query server capabilities.
TRACE : echo the received request for diagnostics.
Common Request Headers
Cookie
User-Agent : browser name
Referer : page that linked to the request
Host : host and port
Connection : connection type
Upgrade-Insecure-Requests : request upgrade to HTTPS
Accept : accepted media types
Accept-Encoding : supported encodings
x-requested-with: XMLHttpRequest : AJAX request
HTTP Response
An HTTP response consists of a status line, response headers, a blank line, and the response body.
Response Headers
Location : used with 302 to indicate redirect target.
Set-Cookie : set cookies for the page.
Content-Type : MIME type of the returned data.
Server : server software information.
Content-Length : length of the response body.
Connection : whether to keep the connection alive.
HTTP Status Codes
Status codes are three‑digit numbers; the first digit defines the class.
1xx – informational
2xx – success
3xx – redirection
4xx – client error
5xx – server error
Common codes: 200 (OK), 301 (Moved Permanently), 404 (Not Found), 500 (Internal Server Error).
HTTPS
1 - HTTP + SSL (Secure Sockets Layer), i.e., HTTP over SSL
2 - Default port: 443HTTPS encrypts data during transmission to prevent intermediate devices from tampering with it.
HTTP is faster because it does not encrypt data, but it is less secure. HTTPS provides higher security at the cost of additional encryption overhead.
Note: HTTPS is the future mainstream; interfaces for WeChat Mini Programs, iOS, and Android clients must support HTTPS.
(End)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
