Fundamentals 10 min read

What Is a Web Crawler? Definitions, Types, and How It Works

This article explains web crawlers—what they are, their classifications, typical use cases, and step‑by‑step workflow—covers the robots protocol, then delves into HTTP and HTTPS fundamentals, request/response structures, common methods, headers, status codes, and the security trade‑offs of HTTPS.

21CTO

May 22, 2019

Web Crawler Definition, Types, and Process

Definition

Web crawlers (also called spiders or bots) simulate a browser by sending network requests, receiving responses, and automatically fetching Internet information according to certain rules. The more a crawler mimics a real browser, the harder it is to detect. In principle, any action a browser can perform, a crawler can perform.

Classification

General crawler: typically used by search engines.

Focused crawler: targets a specific website.

Uses

News aggregation (e.g., Toutiao)

Music platforms (e.g., NetEase Cloud Music)

Ticket booking (e.g., 12306)

Website voting

SMS bombing

etc.

Process

Send a request to the start URL and obtain the response.

Extract data from the response.

If a URL is extracted, send another request to retrieve its response.

If data is extracted, save the data.

Robots Protocol

Websites use the robots protocol to tell search engines which pages may be crawled; it is a moral, not technical, constraint (e.g., Taobao's robots.txt).

HTTP and HTTPS Concepts

HTTP

HTTP (Hypertext Transfer Protocol) is an application‑layer client/server communication protocol composed of requests and responses and is stateless. The protocol defines the data format that both parties must follow.

HTTP Request Flow

Browser resolves the domain name via DNS to get the IP address.

Browser sends a request to the IP and receives a response.

The response HTML contains URLs for CSS, JS, images, and AJAX; the browser sequentially requests these resources.

Each received response is rendered, and scripts may trigger further requests.

The whole sequence constitutes the browser’s rendering process.

Five‑Layer Network Model

URL Format

Format: scheme://host[:port]/path/…/[?query‑string][#anchor]

scheme – protocol (e.g., http, https, ftp)

host – server IP or domain name

port – server port (default 80 for HTTP)

path – resource path

query‑string – parameters sent to the server

anchor – fragment identifier

HTTP Request Methods

According to the HTTP standard, various request methods exist.

HTTP/1.0 defines GET, POST, HEAD.

HTTP/1.1 adds OPTIONS, PUT, DELETE, TRACE, CONNECT.

Common methods:

GET : retrieve a page and its body.

HEAD : like GET but returns only headers.

POST : submit data to be processed (e.g., form submission).

PUT : replace the target document with the supplied data.

DELETE : request deletion of the specified resource.

CONNECT : used by proxies to establish a tunnel.

OPTIONS : query server capabilities.

TRACE : echo the received request for diagnostics.

Common Request Headers

User-Agent : browser name

Referer : page that linked to the request

Host : host and port

Connection : connection type

Upgrade-Insecure-Requests : request upgrade to HTTPS

Accept : accepted media types

Accept-Encoding : supported encodings

x-requested-with: XMLHttpRequest : AJAX request

HTTP Response

An HTTP response consists of a status line, response headers, a blank line, and the response body.

Response Headers

Location : used with 302 to indicate redirect target.

Set-Cookie : set cookies for the page.

Content-Type : MIME type of the returned data.

Server : server software information.

Content-Length : length of the response body.

Connection : whether to keep the connection alive.

HTTP Status Codes

Status codes are three‑digit numbers; the first digit defines the class.

1xx – informational

2xx – success

3xx – redirection

4xx – client error

5xx – server error

Common codes: 200 (OK), 301 (Moved Permanently), 404 (Not Found), 500 (Internal Server Error).

HTTPS

1 - HTTP + SSL (Secure Sockets Layer), i.e., HTTP over SSL
2 - Default port: 443

HTTPS encrypts data during transmission to prevent intermediate devices from tampering with it.

HTTP is faster because it does not encrypt data, but it is less secure. HTTPS provides higher security at the cost of additional encryption overhead.

Note: HTTPS is the future mainstream; interfaces for WeChat Mini Programs, iOS, and Android clients must support HTTPS.

(End)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

HTTP Status Codes network protocol request methods robots.txt Web Crawler

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.