Tagged articles

robots.txt

13 articles · Page 1 of 1

Mar 30, 2026 · Industry Insights

How to Optimize Your Content for GEO and Get Cited by DeepSeek, Doubao, and ChatGPT

This guide explains what Generative Engine Optimization (GEO) is, why AI‑driven search traffic converts far better than traditional SEO, and provides concrete writing, platform‑specific, and technical steps—including robots.txt, llms.txt, and Schema markup—to make your content reliably cited by Chinese AI search engines and global models.

AI SEOChinese AIContent Optimization

0 likes · 22 min read

How to Optimize Your Content for GEO and Get Cited by DeepSeek, Doubao, and ChatGPT

Java Tech Enthusiast

Feb 26, 2026 · Fundamentals

Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era

From a 1993 accidental DoS attack that sparked the creation of robots.txt to modern AI crawlers ignoring the protocol, this article traces the history, purpose, and challenges of the robots exclusion standard and explores new proposals to adapt it for AI-driven web scraping.

AI ethicsSearch Engineprotocol

0 likes · 9 min read

Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era

IT Services Circle

Jan 31, 2026 · Information Security

Why the Humble robots.txt Is Facing an Existential Crisis in the AI Era

The article recounts a personal experiment that unintentionally launched a DoS attack, explains how that incident spurred the creation of the robots.txt protocol, and examines how AI‑driven data scraping, legal battles, and new licensing proposals are challenging its relevance today.

AI data scrapingSearch Engineinternet standards

0 likes · 10 min read

Why the Humble robots.txt Is Facing an Existential Crisis in the AI Era

Architecture and Beyond

Jul 1, 2023 · Industry Insights

Web Crawlers Unveiled: History, Value, and How to Tackle Their Challenges

This article traces the development of web crawlers from their 1990s origins to modern implementations, examines their multifaceted value in search, data analysis, and archiving, outlines technical, ethical, and legal challenges for both crawler creators and target sites, and presents practical strategies to mitigate malicious crawling.

anti-scrapingdata extractionrobots.txt

0 likes · 24 min read

Web Crawlers Unveiled: History, Value, and How to Tackle Their Challenges

Sohu Tech Products

Oct 6, 2021 · Frontend Development

Front‑End SEO Technical Optimization Guide

This article presents a comprehensive front‑end SEO checklist, covering passive and active optimization techniques such as site structure, meta tags, semantic links, speed improvements, external traffic acquisition, sitemaps, robots.txt, and search‑engine‑specific configurations to help developers enhance website visibility and ranking.

Meta TagsSEOWeb Optimization

0 likes · 13 min read

Front‑End SEO Technical Optimization Guide

Python Crawling & Data Mining

Sep 7, 2021 · Information Security

Why Web Scraping Isn’t Illegal—Legal Risks, Ethics, and Best Practices

This article explains the legal and ethical pitfalls of Python web scraping, clarifies what truly counts as a crawler, discusses robots.txt and service agreements, warns against profiting from scraped data, and offers practical advice for responsible and low‑risk data collection.

Pythondata privacylegal risk

0 likes · 8 min read

Why Web Scraping Isn’t Illegal—Legal Risks, Ethics, and Best Practices

Python Crawling & Data Mining

Dec 12, 2020 · Fundamentals

Master Python’s urllib: From Basics to Advanced Web Scraping

Learn how to use Python’s built-in urllib library for web requests, handling GET/POST, adding headers, managing proxies, processing cookies, handling errors, parsing URLs, and respecting robots.txt, with clear code examples and a practical case of scraping a novel site.

HTTP requestsPythoncookies

0 likes · 12 min read

Master Python’s urllib: From Basics to Advanced Web Scraping

Python Programming Learning Circle

Jan 2, 2020 · Backend Development

How to Crawl Responsibly: Avoid Legal Risks and Server Overload

This guide outlines responsible web‑crawling practices, covering robots.txt compliance, legal pitfalls such as unauthorized personal data and copyrighted content, recommended request intervals, and relevant Chinese data‑security regulations, helping developers avoid server overloads and potential lawsuits.

Backend DevelopmentData EthicsScrapy

0 likes · 4 min read

How to Crawl Responsibly: Avoid Legal Risks and Server Overload

MaGe Linux Operations

Dec 25, 2019 · Backend Development

Master Web Crawling in Python: From urllib to requests and Robots.txt

This guide explains the fundamentals of web crawling, covering crawler types, the Robots.txt protocol, Python's urllib and urllib3 modules, the requests library, handling HTTP methods, user‑agents, HTTPS certificates, and practical code examples for extracting data from websites.

Pythonrequestsrobots.txt

0 likes · 18 min read

Master Web Crawling in Python: From urllib to requests and Robots.txt

21CTO

May 22, 2019 · Fundamentals

What Is a Web Crawler? Definitions, Types, and How It Works

This article explains web crawlers—what they are, their classifications, typical use cases, and step‑by‑step workflow—covers the robots protocol, then delves into HTTP and HTTPS fundamentals, request/response structures, common methods, headers, status codes, and the security trade‑offs of HTTPS.

HTTPStatus CodesWeb Crawler

0 likes · 10 min read

What Is a Web Crawler? Definitions, Types, and How It Works

Python Crawling & Data Mining

Apr 29, 2019 · Backend Development

Boost Your Scrapy Debugging: Master robots.txt Settings and Shell Tricks

Learn how to disable robots.txt compliance in Scrapy, use the Scrapy shell for rapid URL debugging, and apply XPath selectors directly in the shell to efficiently extract data, dramatically speeding up development and avoiding repeated full-crawl executions.

PythonScrapyXPath

0 likes · 4 min read

Boost Your Scrapy Debugging: Master robots.txt Settings and Shell Tricks

MaGe Linux Operations

Dec 5, 2017 · Information Security

How to Defend Your Website Against Web Crawlers: Techniques & Tools

This article explores why web content needs protection, explains common server‑side and client‑side anti‑crawling methods—including User‑Agent checks, token cookies, headless‑browser detection, fingerprinting, captchas, and robots.txt—and offers practical guidance for raising the cost of unauthorized scraping.

Browser FingerprintingHeadless Browseranti‑crawling

0 likes · 12 min read

How to Defend Your Website Against Web Crawlers: Techniques & Tools

21CTO

Nov 13, 2016 · Backend Development

How to Build a Simple PHP Web Crawler: From Robots.txt to cURL

This guide explains the fundamentals of creating a PHP web crawler, covering server communication basics, interpreting robots.txt and sitemap files, and providing practical code examples using file_get_contents and cURL for efficient content retrieval.

Backend DevelopmentPHPWeb Crawler

0 likes · 6 min read

How to Build a Simple PHP Web Crawler: From Robots.txt to cURL