Tagged articles
104 articles
Page 1 of 2
IT Services Circle
IT Services Circle
Feb 28, 2026 · Artificial Intelligence

Unlock Adaptive Crawling, AI Agent Memory, and Remote Claude Code with Open‑Source Tools

This article introduces four open‑source projects—Scrapling for self‑adjusting web crawling, Agent‑Skills‑for‑Context‑Engineering for AI agent memory management, claude‑code‑telegram for remote Claude Code access via Telegram, and Hugging Face Skills for versatile AI task automation—detailing their core features, popularity, and installation steps.

AI agentsHugging FacePython
0 likes · 7 min read
Unlock Adaptive Crawling, AI Agent Memory, and Remote Claude Code with Open‑Source Tools
Java Tech Enthusiast
Java Tech Enthusiast
Feb 26, 2026 · Fundamentals

Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era

From a 1993 accidental DoS attack that sparked the creation of robots.txt to modern AI crawlers ignoring the protocol, this article traces the history, purpose, and challenges of the robots exclusion standard and explores new proposals to adapt it for AI-driven web scraping.

AI ethicsWeb Crawlingprotocol
0 likes · 9 min read
Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era
IT Services Circle
IT Services Circle
Jan 31, 2026 · Information Security

Why the Humble robots.txt Is Facing an Existential Crisis in the AI Era

The article recounts a personal experiment that unintentionally launched a DoS attack, explains how that incident spurred the creation of the robots.txt protocol, and examines how AI‑driven data scraping, legal battles, and new licensing proposals are challenging its relevance today.

AI data scrapingWeb Crawlinginternet standards
0 likes · 10 min read
Why the Humble robots.txt Is Facing an Existential Crisis in the AI Era
php Courses
php Courses
Nov 7, 2025 · Backend Development

Build a Simple PHP Web Crawler on Linux: Step-by-Step Guide

This article explains how to create a basic PHP web crawler on a Linux system, covering prerequisite installations, script development with cURL and DOMDocument, execution commands, and sample output, while emphasizing legal and ethical considerations for web scraping.

DOMDocumentLinuxPHP
0 likes · 4 min read
Build a Simple PHP Web Crawler on Linux: Step-by-Step Guide
Nightwalker Tech
Nightwalker Tech
Mar 14, 2025 · Backend Development

Overview and Installation Guide for Various MCP Services and Their Use with Sequential Thinking for Manus‑like Effects

This article introduces several Model Context Protocol (MCP) services—including Sequential Thinking, Firecrawl, Fetch, Hot News, Playwright, Magic, and Brave Search—provides their GitHub links, detailed Mac and Windows installation commands, and explains how to combine them with a Sequential Thinking prompt to achieve a Manus‑style AI agent workflow.

AIAutomationInstallation
0 likes · 9 min read
Overview and Installation Guide for Various MCP Services and Their Use with Sequential Thinking for Manus‑like Effects
Python Programming Learning Circle
Python Programming Learning Circle
Dec 21, 2024 · Backend Development

Comprehensive List of Python Libraries for Web Crawling, Data Processing, and Web Development

This article provides an extensive overview of Python libraries and frameworks for web crawling, data extraction, parsing, storage, browser automation, asynchronous programming, and popular web development frameworks, helping readers choose appropriate tools for their projects.

Web CrawlingWeb Developmentdata-processing
0 likes · 9 min read
Comprehensive List of Python Libraries for Web Crawling, Data Processing, and Web Development
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 7, 2024 · Backend Development

Integrating XXL‑Job for Scheduled Hot‑Search Crawlers in a Java Backend

This tutorial explains how to replace the basic @Scheduled annotation with the flexible XXL‑Job distributed scheduler, covering repository download, admin deployment, database initialization, Spring‑Boot executor configuration, job registration for Douyin and Bilibili hot‑search crawling, and a Vue front‑end component for displaying ranked results with real‑time update timestamps.

Backend DevelopmentJavaScheduled Tasks
0 likes · 14 min read
Integrating XXL‑Job for Scheduled Hot‑Search Crawlers in a Java Backend
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 29, 2024 · Backend Development

Storing Douyin and Baidu Hot Search Data with MySQL, MyBatis Generator, and Java Crawlers

This tutorial explains how to design a MySQL table for hot‑search records, generate Java entity and mapper classes using MyBatis Generator, create unique IDs for each entry, and implement scheduled Java crawlers for Douyin and Baidu hot‑search data that persist the results via Spring‑Boot services.

BackendDatabase designJava
0 likes · 19 min read
Storing Douyin and Baidu Hot Search Data with MySQL, MyBatis Generator, and Java Crawlers
Open Source Tech Hub
Open Source Tech Hub
Oct 5, 2024 · Backend Development

Boost Your PHP Crawling with PHPCreeper: A Complete Step‑by‑Step Guide

PHPCreeper is a high‑performance PHP crawler built on Workerman that leverages asynchronous I/O, multi‑process, distributed deployment and headless‑browser support; this guide covers installation via Composer, core architecture, producer/downloader/parser implementation, Redis configuration and how to start the service to fetch dynamic pages such as weather forecasts.

ComposerPHPPHPCreeper
0 likes · 13 min read
Boost Your PHP Crawling with PHPCreeper: A Complete Step‑by‑Step Guide
Python Programming Learning Circle
Python Programming Learning Circle
Jan 24, 2024 · Backend Development

Running Scrapy Crawlers: Command‑Line, CrawlerProcess, and CrawlerRunner Approaches

This tutorial demonstrates how to execute Scrapy spiders from the command line, run them within Python files using cmdline, and manage single or multiple spiders with CrawlerProcess and CrawlerRunner, highlighting configuration steps, limitations, and best‑practice recommendations.

Backend DevelopmentCrawlerProcessCrawlerRunner
0 likes · 3 min read
Running Scrapy Crawlers: Command‑Line, CrawlerProcess, and CrawlerRunner Approaches
Python Programming Learning Circle
Python Programming Learning Circle
Jan 23, 2024 · Backend Development

Comprehensive Guide to Python Libraries for Web Crawling, Web Development, and Asynchronous Programming

This article provides an extensive overview of Python libraries and frameworks for web crawling, data extraction, asynchronous networking, browser automation, and popular web development frameworks, helping developers choose the right tools for backend projects and avoid common misconceptions when selecting a framework.

PythonWeb CrawlingWeb Frameworks
0 likes · 9 min read
Comprehensive Guide to Python Libraries for Web Crawling, Web Development, and Asynchronous Programming
Architecture and Beyond
Architecture and Beyond
Jul 1, 2023 · Industry Insights

Web Crawlers Unveiled: History, Value, and How to Tackle Their Challenges

This article traces the development of web crawlers from their 1990s origins to modern implementations, examines their multifaceted value in search, data analysis, and archiving, outlines technical, ethical, and legal challenges for both crawler creators and target sites, and presents practical strategies to mitigate malicious crawling.

Data ExtractionSecurityWeb Crawling
0 likes · 24 min read
Web Crawlers Unveiled: History, Value, and How to Tackle Their Challenges
Big Data Technology Architecture
Big Data Technology Architecture
Feb 11, 2023 · Backend Development

Understanding Scrapy and Twisted: Architecture, Components, and Debugging Techniques

This article explains Scrapy's comprehensive crawling framework and Twisted's event‑driven networking engine, detailing their core concepts, workflow, code execution process, and how to debug Scrapy spiders using breakpoint tracing, providing a deep technical overview for backend developers.

Backend DevelopmentEvent-drivenPython
0 likes · 15 min read
Understanding Scrapy and Twisted: Architecture, Components, and Debugging Techniques
FunTester
FunTester
Nov 18, 2022 · Backend Development

Master Java Web Crawling: From Data Scraping to Image Storage

This guide walks beginners through building a Java web crawler that fetches bestseller book cover images, covering data scraping, HTML parsing with jsoup or regex, and saving images locally, illustrated step‑by‑step with code examples and a tiered learning roadmap.

Backend DevelopmentImage DownloadJava
0 likes · 5 min read
Master Java Web Crawling: From Data Scraping to Image Storage
Architecture Digest
Architecture Digest
Sep 24, 2022 · Information Security

Web Crawling and Anti‑Crawling Techniques: Principles, Implementation, and Countermeasures

This article explains the technical principles and implementation steps of web crawlers, introduces common crawling frameworks, provides a Python example for extracting app store rankings, and then details various anti‑crawling methods such as CSS offset, image camouflage, custom fonts, dynamic rendering, captchas, request signing, and honeypots, followed by counter‑strategies for each.

PythonScrapyWeb Crawling
0 likes · 24 min read
Web Crawling and Anti‑Crawling Techniques: Principles, Implementation, and Countermeasures
vivo Internet Technology
vivo Internet Technology
Sep 14, 2022 · Information Security

Web Crawling, Anti‑Crawling, and Anti‑Anti‑Crawling Techniques: Principles, Frameworks, and Code Examples

The article explains web‑crawling basics, Python and Scrapy examples, then surveys common anti‑crawling defenses such as CSS offsets, image camouflage, custom fonts, dynamic rendering, captchas, request signatures and honeypots, and finally presents anti‑anti‑crawling countermeasures—including CSS‑offset reversal, font decoding, headless‑browser rendering and YOLOv5‑based captcha cracking, while stressing legal compliance.

CaptchaPythonScrapy
0 likes · 25 min read
Web Crawling, Anti‑Crawling, and Anti‑Anti‑Crawling Techniques: Principles, Frameworks, and Code Examples
Python Programming Learning Circle
Python Programming Learning Circle
Jul 13, 2022 · Backend Development

Comprehensive Scrapy Tutorial: Architecture, XPath Basics, Installation, Project Setup, and Advanced Features

This article provides a detailed walkthrough of Scrapy, covering its event‑driven architecture, component interactions, XPath parsing fundamentals, installation steps, project creation, sample spider code, item pipelines, middleware customization, and essential configuration settings for effective web crawling in Python.

PipelineScrapySpider
0 likes · 12 min read
Comprehensive Scrapy Tutorial: Architecture, XPath Basics, Installation, Project Setup, and Advanced Features
IT Services Circle
IT Services Circle
Jul 5, 2022 · Backend Development

Optimizing feapder Spider with Gevent: Reducing CPU Usage and Thread Count

This article demonstrates how adding two gevent monkey‑patch lines to a feapder spider reduces CPU usage from 121% to 99% while changing the effective thread count from 36 to 12, and discusses the underlying principle, performance trade‑offs, and future directions for coroutine support.

CPU optimizationPythonWeb Crawling
0 likes · 6 min read
Optimizing feapder Spider with Gevent: Reducing CPU Usage and Thread Count
IT Services Circle
IT Services Circle
Feb 25, 2022 · Backend Development

Detecting and Handling Gzip Bombs in Web Crawling with Python Requests

This article explains how to identify gzip‑compressed responses that may be gzip bombs, how to inspect HTTP headers and raw response data using Python's requests library, and provides command‑line and code examples for measuring compressed and uncompressed sizes without triggering decompression.

GzipWeb Crawlingcompression
0 likes · 5 min read
Detecting and Handling Gzip Bombs in Web Crawling with Python Requests
Architecture Digest
Architecture Digest
Feb 19, 2022 · Information Security

Case Study: Illegal Web Crawling and Criminal Conviction in China

This article recounts how a corporate web‑crawling tool designed to automate housing‑loan data collection overloaded a municipal residence‑permit system, triggered a large‑scale denial‑of‑service attack, and led to the CTO and programmer being prosecuted for damaging a computer information system.

Web Crawlingcomputer crimecyberlaw
0 likes · 8 min read
Case Study: Illegal Web Crawling and Criminal Conviction in China
Java High-Performance Architecture
Java High-Performance Architecture
Feb 18, 2022 · Information Security

When Web Crawlers Cross the Line: A Legal Case Study on Unauthorized Data Scraping

This article recounts how a Chinese fintech company's automated web‑crawler, built to query a municipal residence‑permit system, overloaded the server, triggered police action, led to criminal charges for the CTO and programmer, and offers lessons on the legal risks of large‑scale data scraping.

Web Crawlingcloud computingcomputer crime
0 likes · 9 min read
When Web Crawlers Cross the Line: A Legal Case Study on Unauthorized Data Scraping
21CTO
21CTO
Feb 5, 2022 · Information Security

When Web Crawlers Turn Criminal: A Real‑World Data Scraping Case Study

This article recounts how a fintech company's automated web‑scraping tool overloaded a municipal residence‑permit system, leading to massive data leakage, legal prosecution of its CTO and programmer, and highlights the severe legal risks of unchecked crawling practices.

Web Crawlingcomputer crimedata-scraping
0 likes · 9 min read
When Web Crawlers Turn Criminal: A Real‑World Data Scraping Case Study
Python Programming Learning Circle
Python Programming Learning Circle
Aug 20, 2021 · Backend Development

Python Crawler for Scraping Baidu Baike Articles

This article presents a complete Python web crawler example that extracts Baidu Baike entries, detailing the implementation of URL management, page downloading, HTML parsing with BeautifulSoup, data collection, and output generation, along with sample code and usage instructions.

BaikeScrapingWeb Crawling
0 likes · 9 min read
Python Crawler for Scraping Baidu Baike Articles
Programmer DD
Programmer DD
Jul 31, 2021 · Backend Development

Build a Spring Boot Web Crawler with WebMagic, MyBatis, and MySQL

This tutorial demonstrates how to combine Spring Boot, WebMagic, and MyBatis to crawl Zhihu pages, configure Maven dependencies, set up MySQL data sources, define entity and mapper classes, and schedule the crawler to run periodically, providing a complete Java web‑crawling scaffold.

JavaMyBatisScheduler
0 likes · 14 min read
Build a Spring Boot Web Crawler with WebMagic, MyBatis, and MySQL
Tencent Cloud Developer
Tencent Cloud Developer
Jan 21, 2021 · Big Data

A Beginner's Guide to Using Scrapy for Web Crawling

This beginner‑friendly guide walks readers through installing Scrapy, creating a project and spider, running and debugging crawlers, implementing parsing with CSS/XPath, and overcoming common hurdles such as JavaScript rendering, user‑agent spoofing, and proxy rotation via configurable middlewares, enabling quick start of web‑crawling projects.

Data ExtractionProxyPython
0 likes · 13 min read
A Beginner's Guide to Using Scrapy for Web Crawling
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jan 7, 2021 · Big Data

When Web Crawlers Cross the Legal Line: Data‑Driven Case Analysis

This article examines the rise of web crawler technology in big‑data contexts, clarifies the distinction between legitimate data collection and illegal intrusion, presents statistical analysis of recent court cases involving crawlers, and offers practical legal guidelines for developers and data professionals to avoid criminal liability.

Legal AnalysisSoftware ComplianceWeb Crawling
0 likes · 8 min read
When Web Crawlers Cross the Legal Line: Data‑Driven Case Analysis
FunTester
FunTester
Dec 15, 2020 · Backend Development

Run All Scrapy Spiders Together and Fix Video Download Errors

This guide shows how to create a custom Scrapy command to launch every spider at once, separate each spider's settings for better modularity, and resolve video download problems by adjusting request headers and handling file saving correctly.

BackendCustom CommandPython
0 likes · 5 min read
Run All Scrapy Spiders Together and Fix Video Download Errors
Top Architect
Top Architect
Oct 31, 2020 · Big Data

Building a Zhihu User Data Crawler and Large‑Scale Analysis with SpringBoot, SeimiCrawler, RabbitMQ, ElasticSearch, and Kibana

This article describes how to build a Java‑based crawler to collect millions of Zhihu user profiles, handle anti‑crawling measures with rotating user‑agents and a proxy pool, deduplicate data using a Bloom filter, import the results into ElasticSearch, and analyze the dataset with Kibana and ECharts visualizations.

Big DataElasticsearchJava
0 likes · 15 min read
Building a Zhihu User Data Crawler and Large‑Scale Analysis with SpringBoot, SeimiCrawler, RabbitMQ, ElasticSearch, and Kibana
ITPUB
ITPUB
Oct 23, 2020 · Fundamentals

How General Search Engines Work: From Crawlers to Ranking

This article provides a comprehensive overview of general search engines, covering their classification, core workflow, key modules such as web crawlers, content processing, storage, user query handling, ranking strategies like TF‑IDF and PageRank, as well as anti‑cheat measures and user intent understanding.

PageRankTF-IDFWeb Crawling
0 likes · 16 min read
How General Search Engines Work: From Crawlers to Ranking
FunTester
FunTester
Jul 1, 2020 · Operations

Curated List of Performance Testing, Bug Cases, and Web Crawling Articles

This collection provides a curated set of links to articles covering performance testing strategies, notable bug analyses, and practical web crawling implementations, offering valuable insights for software testers and engineers seeking to improve testing practices and data extraction techniques.

BackendBug AnalysisSoftware Testing
0 likes · 6 min read
Curated List of Performance Testing, Bug Cases, and Web Crawling Articles
Python Programming Learning Circle
Python Programming Learning Circle
Jun 3, 2020 · Information Security

Anti‑Crawling Techniques: Server‑Side and Client‑Side Detection Strategies

The article examines why web content needs protection, explains common server‑side header checks, describes client‑side JavaScript fingerprinting and headless‑browser detection methods, and outlines practical anti‑crawling measures such as CAPTCHAs and robots.txt, highlighting the ongoing cat‑and‑mouse game between crawlers and defenders.

CaptchaHTTP header inspectionWeb Crawling
0 likes · 12 min read
Anti‑Crawling Techniques: Server‑Side and Client‑Side Detection Strategies
Java Architecture Diary
Java Architecture Diary
Mar 25, 2020 · Backend Development

Mastering mica-http v1.1.7: Advanced Crawling Techniques and Proxy Management

This article continues the mica-http guide, showcasing how version v1.1.7 introduces enhanced proxy handling, retry mechanisms, page crawling, model integration, and result processing, while providing documentation links, example projects, and open‑source tool recommendations for building lightweight Java crawlers.

Backend DevelopmentJavaProxy
0 likes · 3 min read
Mastering mica-http v1.1.7: Advanced Crawling Techniques and Proxy Management
Python Programming Learning Circle
Python Programming Learning Circle
Jan 2, 2020 · Backend Development

How to Crawl Responsibly: Avoid Legal Risks and Server Overload

This guide outlines responsible web‑crawling practices, covering robots.txt compliance, legal pitfalls such as unauthorized personal data and copyrighted content, recommended request intervals, and relevant Chinese data‑security regulations, helping developers avoid server overloads and potential lawsuits.

Backend DevelopmentData EthicsScrapy
0 likes · 4 min read
How to Crawl Responsibly: Avoid Legal Risks and Server Overload
MaGe Linux Operations
MaGe Linux Operations
Dec 27, 2019 · Backend Development

Master Scrapy: Build Powerful Python Web Crawlers Step‑by‑Step

This guide introduces the Scrapy framework, explains its architecture—including engine, scheduler, downloader, spiders, pipelines, and middlewares—covers installation, project setup, item definition, spider coding, pipeline handling, pagination, and provides practical code examples for extracting data from Douban books.

Data ExtractionItem PipelinePython
0 likes · 18 min read
Master Scrapy: Build Powerful Python Web Crawlers Step‑by‑Step
21CTO
21CTO
Dec 9, 2019 · Big Data

China’s Big Data Crackdown: Legal Risks Every Developer Should Know

The article examines the sweeping regulatory crackdown on China’s big‑data and financial‑risk companies, detailing the dissolution of major crawler firms, new legal restrictions on data collection, and practical guidance on what data‑scraping activities are illegal and how to protect personal information.

Big DataLegal ComplianceWeb Crawling
0 likes · 11 min read
China’s Big Data Crackdown: Legal Risks Every Developer Should Know
Programmer DD
Programmer DD
Dec 7, 2019 · Backend Development

Why Choose Java Over Python for Web Crawling? A Practical Guide

The article shares the author's journey from manual data collection to mastering Java web crawlers, explains why Java is preferred over Python, outlines the five-step crawling workflow, covers essential Java basics, HTTP fundamentals, and provides code examples for URL queuing, time parsing, and timestamp conversion.

Backend DevelopmentData ExtractionHTTP
0 likes · 12 min read
Why Choose Java Over Python for Web Crawling? A Practical Guide
21CTO
21CTO
Nov 16, 2019 · Fundamentals

From Early Crawlers to ByteDance: A History of Web Scraping

This article traces the evolution of web crawlers—from early Perl scripts to modern ByteDance agents—explaining their role in search engines, business models, anti‑crawling measures, and the impact on content creation and competition.

Web Crawlingcontent aggregationdata-scraping
0 likes · 6 min read
From Early Crawlers to ByteDance: A History of Web Scraping
Java Backend Technology
Java Backend Technology
Oct 31, 2019 · Operations

When Does Programming Cross the Legal Line? A Developer's Risk Guide

This article explains how common programming activities such as web crawling, developing gambling or adult sites, P2P platforms, and game cheats can violate Chinese laws, outlines the legal criteria for each case, and offers practical advice for developers to protect themselves from criminal liability.

Legal ComplianceWeb Crawlingdata privacy
0 likes · 18 min read
When Does Programming Cross the Legal Line? A Developer's Risk Guide
FunTester
FunTester
Oct 19, 2019 · Backend Development

Building a Fast Historical‑Today Crawler with Java and MySQL

An open‑source Java crawler that fetches historical‑today events from a public API is presented, detailing three practical challenges—GET request length limits, ambiguous JSON value types, and month string construction—along with a full code example and a GitHub repository link for reference.

Data ExtractionGitHubHTTP
0 likes · 5 min read
Building a Fast Historical‑Today Crawler with Java and MySQL
Python Crawling & Data Mining
Python Crawling & Data Mining
Oct 11, 2019 · Backend Development

How to Build a Python Web Crawler to Map 2019 Chinese National Day Travel Hotspots

This article walks through the complete process of designing, implementing, and visualizing a Python web crawler that extracts tourism hotspot data from ticketing sites for China's 2019 National Day holiday, covering requirement analysis, URL and element inspection, data collection, cleaning, and geographic heat‑map presentation using Pyecharts.

PyechartsPythonTourism
0 likes · 11 min read
How to Build a Python Web Crawler to Map 2019 Chinese National Day Travel Hotspots
FunTester
FunTester
Sep 15, 2019 · Backend Development

How to Build a Java HttpClient Spider for Scraping Movie Details and Download Links

This article explains how to update and use a Java HttpClient‑based spider that removes duplicate links, handles legacy page formats, extracts movie metadata and download URLs (magnet, ed2k, Baidu Pan), and stores the results in a MySQL database, with complete source code examples.

Data ExtractionHttpClientJava
0 likes · 12 min read
How to Build a Java HttpClient Spider for Scraping Movie Details and Download Links
21CTO
21CTO
Aug 16, 2019 · Backend Development

Master Scrapy: Build, Deploy, and Scale a Python Web Crawler Platform

This guide walks through designing a full‑featured web‑crawler platform, covering rule maintenance, job scheduling, async and real‑time crawling with Scrapy, project setup, item pipelines, settings, local execution, custom parameters, server deployment via Scrapyd, API usage, and fast real‑time crawling with Requests, BeautifulSoup, Flask, and multithreading.

AsyncFlaskPython
0 likes · 16 min read
Master Scrapy: Build, Deploy, and Scale a Python Web Crawler Platform
Architecture Digest
Architecture Digest
Aug 15, 2019 · Backend Development

Design and Implementation of a Scrapy‑Based Web Crawling Platform

This article explains how to design a flexible web‑crawling platform using Scrapy, covering rule maintenance, job scheduling, asynchronous and real‑time crawlers, project setup, code structure, settings, local execution, deployment with scrapyd, API usage, and examples of Flask‑based real‑time services.

AsyncDeploymentFlask
0 likes · 16 min read
Design and Implementation of a Scrapy‑Based Web Crawling Platform
360 Tech Engineering
360 Tech Engineering
May 31, 2019 · Information Security

Dynamic Web Crawling Techniques for Vulnerability Scanning with Pyppeteer

This article details the practical implementation of a dynamic web crawler for vulnerability scanning, covering Chrome headless setup, browser initialization, JavaScript hook injection for DOM events, navigation locking, form handling, link collection, deduplication, and task scheduling using pyppeteer.

Browser AutomationDynamic analysisWeb Crawling
0 likes · 30 min read
Dynamic Web Crawling Techniques for Vulnerability Scanning with Pyppeteer
58 Tech
58 Tech
May 8, 2019 · Information Security

Overview of Web Crawling, Anti‑Crawling Techniques, and 58 Anti‑Crawling System

This article introduces the fundamentals of web crawlers, typical crawling methods, and a comprehensive set of anti‑crawling strategies—including IP control, browser and device simulation, CAPTCHA cracking, and traffic analysis—while detailing the architecture and capabilities of the 58 anti‑crawling platform.

Traffic analysisWeb Crawlinganti‑crawling
0 likes · 17 min read
Overview of Web Crawling, Anti‑Crawling Techniques, and 58 Anti‑Crawling System
JavaEdge
JavaEdge
Mar 21, 2019 · Backend Development

Master Web Crawling with Scrapy: From Tech Choices to Powerful Regex Extraction

This guide walks through selecting Scrapy over Requests + BeautifulSoup, explains web page types, outlines crawler use‑cases, details regular‑expression syntax and non‑greedy matching, demonstrates practical regex patterns with images, compares depth‑first and breadth‑first crawling, and covers URL deduplication and string‑encoding pitfalls in Python.

PythonScrapyWeb Crawling
0 likes · 11 min read
Master Web Crawling with Scrapy: From Tech Choices to Powerful Regex Extraction
Python Crawling & Data Mining
Python Crawling & Data Mining
Feb 6, 2019 · Backend Development

Master Scrapy: Build Powerful Python Web Crawlers in Minutes

This article introduces the Scrapy framework, explains its architecture and five core components, guides you through creating a Scrapy project, configuring spiders, pipelines, and middlewares, and demonstrates how to run the crawler to efficiently collect and process web data using Python.

Backend DevelopmentPythonScrapy
0 likes · 7 min read
Master Scrapy: Build Powerful Python Web Crawlers in Minutes
MaGe Linux Operations
MaGe Linux Operations
Jan 13, 2019 · Backend Development

How to Build a High‑Performance Novel Site Crawler with MongoDB

This article walks through building a novel‑site crawler using MongoDB, detailing how to extract category links, manage URL states across multiple processes, and handle deduplication, while sharing screenshots of the framework, database logic, and final results.

MongoDBWeb Crawlingmultithreading
0 likes · 3 min read
How to Build a High‑Performance Novel Site Crawler with MongoDB
Python Crawling & Data Mining
Python Crawling & Data Mining
Jan 13, 2019 · Backend Development

How to Fix Common Scrapy Installation Errors on Windows

This guide walks you through step‑by‑step solutions for typical Scrapy installation problems on Windows, covering missing libxml2/lxml wheels, Visual C++ requirements, and Twisted wheel compatibility, so you can get the framework up and running smoothly.

PythonScrapyWeb Crawling
0 likes · 7 min read
How to Fix Common Scrapy Installation Errors on Windows
UC Tech Team
UC Tech Team
Nov 5, 2018 · Artificial Intelligence

News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation

To accurately distinguish news pages from other web page types, this study formulates the task as a binary classification problem, extracts 19 engineered features from HTML, evaluates logistic regression and SVM models with cross‑validation, and achieves over 90% precision, recall, and F1‑score using LR with Newton method.

Web Crawlingbinary classificationfeature engineering
0 likes · 13 min read
News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation
21CTO
21CTO
Sep 7, 2018 · Backend Development

Why Scaling Web Crawlers Is Harder Than You Think: Lessons from 1,000B Pages

This article outlines the major challenges of large‑scale e‑commerce product data extraction—such as ever‑changing site formats, scalable architecture, performance throughput, anti‑bot defenses, and data quality—and shares the hard‑won lessons Scrapinghub gained after crawling over a trillion product pages.

Data ExtractionScaleScrapy
0 likes · 15 min read
Why Scaling Web Crawlers Is Harder Than You Think: Lessons from 1,000B Pages
Qunar Tech Salon
Qunar Tech Salon
Jul 25, 2018 · Information Security

Understanding Web Crawlers: Definitions, Types, Traffic, and Harm

This article introduces web crawlers, classifies them by technology and intent, presents statistics on crawler traffic across industries and regions, and analyzes the various harms they cause, laying the groundwork for future discussions on anti‑crawling strategies.

Traffic analysisWeb Crawlinganti‑crawling
0 likes · 10 min read
Understanding Web Crawlers: Definitions, Types, Traffic, and Harm
MaGe Linux Operations
MaGe Linux Operations
May 16, 2018 · Backend Development

Essential Python Libraries for Web Crawling and Web Development

This guide outlines the core steps of a web request, then presents a comprehensive catalog of Python libraries for crawling, parsing, text processing, automation, concurrency, cloud execution, and popular web frameworks, helping developers choose the right tools for backend projects.

Web Crawlingframeworkslibraries
0 likes · 10 min read
Essential Python Libraries for Web Crawling and Web Development
MaGe Linux Operations
MaGe Linux Operations
Dec 5, 2017 · Information Security

How to Defend Your Website Against Web Crawlers: Techniques & Tools

This article explores why web content needs protection, explains common server‑side and client‑side anti‑crawling methods—including User‑Agent checks, token cookies, headless‑browser detection, fingerprinting, captchas, and robots.txt—and offers practical guidance for raising the cost of unauthorized scraping.

Browser FingerprintingCaptchaHeadless Browser
0 likes · 12 min read
How to Defend Your Website Against Web Crawlers: Techniques & Tools
MaGe Linux Operations
MaGe Linux Operations
Nov 20, 2017 · Backend Development

Mastering Web Crawlers: Core Principles, Architecture, and Modern Challenges

This article explains how web crawlers work—from initial URL seeding and request handling to flow control, content extraction, and handling dynamic pages—while covering essential modules, HTTP details, common obstacles like JavaScript rendering, anti‑scraping measures, and strategies for large‑scale, distributed crawling.

Data ExtractionDistributed SystemsHTTP
0 likes · 14 min read
Mastering Web Crawlers: Core Principles, Architecture, and Modern Challenges
MaGe Linux Operations
MaGe Linux Operations
Aug 10, 2017 · Backend Development

Explore the Ultimate Python Library Collection for Web Crawling and Data Processing

This comprehensive guide lists essential Python libraries for network operations, asynchronous programming, web crawling frameworks, HTML/XML parsing, text handling, data conversion, slug creation, office document manipulation, PDF processing, markdown rendering, YAML handling, CSS utilities, feed parsing, SQL tools, HTTP clients, microformats, executable analysis, PSD handling, natural language processing, browser automation, headless tools, multiprocessing, queues, cloud execution, email handling, URL manipulation, web content extraction, video downloading, wiki archiving, WebSocket communication, DNS queries, computer vision, proxy services, and miscellaneous utilities.

PythonWeb Crawlingdata-processing
0 likes · 17 min read
Explore the Ultimate Python Library Collection for Web Crawling and Data Processing
21CTO
21CTO
Jun 24, 2017 · Information Security

Why 95% of Web Traffic Is Bots: Inside the Crawling Arms Race

The article explores the hidden, high‑traffic world of web crawlers and anti‑crawling measures, revealing why most online requests are bots, how companies decide to crawl or block, the technical and organizational challenges involved, and what the future may hold for this perpetual cat‑and‑mouse game.

BackendWeb Crawlinganti‑crawling
0 likes · 22 min read
Why 95% of Web Traffic Is Bots: Inside the Crawling Arms Race
Baidu Intelligent Testing
Baidu Intelligent Testing
Jun 20, 2017 · Big Data

Design and Challenges of Web Crawlers and Link Scheduling for Knowledge Graph Construction

The article explains how web crawlers (spiders) collect data for knowledge graphs, covering core tasks, major challenges, crawler features, new‑link expansion, storage design, link‑selection scheduling strategies, and the role of large‑scale data mining and machine learning in optimizing crawl efficiency.

Big DataKnowledge GraphSpider
0 likes · 17 min read
Design and Challenges of Web Crawlers and Link Scheduling for Knowledge Graph Construction
MaGe Linux Operations
MaGe Linux Operations
Jun 3, 2017 · Information Security

The Dark Side of Web Crawling: Industry Secrets, Technical Battles, and Future Trends

This article explores the hidden, often unglamorous world of web crawling and anti‑crawling, detailing why companies need these technologies, the massive traffic they generate, the technical arms race between crawlers and defenders, and the evolving strategies and challenges that shape the industry today.

Web Crawlinganti‑crawlinge‑commerce
0 likes · 21 min read
The Dark Side of Web Crawling: Industry Secrets, Technical Battles, and Future Trends
Ctrip Technology
Ctrip Technology
May 22, 2017 · Information Security

The Dark Side of Web Crawling and Anti‑Crawling: Industry Realities and Technical Strategies

This article examines the hidden, often unglamorous world of web crawling and anti‑crawling, revealing why companies deploy aggressive scraping and defensive measures, the technical arms race between crawlers and defenders, the impact on engineers' careers, and future trends in this contested space.

Web Crawlinganti‑crawlingdata-scraping
0 likes · 21 min read
The Dark Side of Web Crawling and Anti‑Crawling: Industry Realities and Technical Strategies
MaGe Linux Operations
MaGe Linux Operations
Mar 28, 2017 · Backend Development

Master Scrapy: Step-by-Step Guide to Install the Powerful Python Web Crawler

This article walks you through the complete installation process for Scrapy, the Python-based web crawling framework, covering prerequisite Python setup, required dependencies like lxml, setuptools, zope.interface, Twisted, pyOpenSSL, win32py, and finally verifying the installation, preparing you for large‑scale data extraction tasks.

Data ExtractionInstallationPython
0 likes · 4 min read
Master Scrapy: Step-by-Step Guide to Install the Powerful Python Web Crawler
21CTO
21CTO
Nov 20, 2016 · Backend Development

Mastering Web Crawlers: Strategies, Tools, and Practical Code Samples

This article explores the fundamentals and advanced techniques of building web crawlers, covering crawler types, essential features, RSS/ATOM harvesting, custom scraping methods, PHP header manipulation, regex extraction, and concurrency, providing actionable code examples for backend developers.

Backend DevelopmentData ExtractionRSS
0 likes · 9 min read
Mastering Web Crawlers: Strategies, Tools, and Practical Code Samples
21CTO
21CTO
Nov 9, 2016 · Backend Development

Unlocking the Power of Web Crawlers: How to Harvest Data Efficiently

This article explains what web crawlers are, why they’re essential for content recommendation systems, the technical approaches across languages, practical use‑cases like price monitoring and news aggregation, and best practices for building efficient, ethical crawlers.

Backend DevelopmentData ExtractionWeb Crawling
0 likes · 5 min read
Unlocking the Power of Web Crawlers: How to Harvest Data Efficiently
21CTO
21CTO
Jun 9, 2016 · Backend Development

Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers

This article explains what a web crawler is, shows a minimal three‑line Python example, expands it into a functional crawler, identifies common shortcomings, and presents practical solutions such as parallelism, priority queues, DNS caching, Bloom‑filter deduplication, storage choices, and inter‑process communication for building robust distributed scrapers.

ParallelismWeb Crawlingdeduplication
0 likes · 9 min read
Mastering Web Crawlers: From a 3‑Line Script to Scalable Distributed Scrapers