Tagged articles

web crawling

108 articles · Page 1 of 2

Jun 19, 2026 · Artificial Intelligence

Mastering Data Acquisition for AI Agents: From Crawler Pitfalls to MCP Browser Control

The article distills three Bright Data webinars, detailing how to overcome traditional web‑crawling challenges with an adaptive Crawler API, integrate the Model Context Protocol (MCP) for human‑like browser control, and build a LangGraph‑powered AI search engine while addressing compliance, billing, and scaling considerations.

AI AgentsAPI billingBright Data

0 likes · 15 min read

Mastering Data Acquisition for AI Agents: From Crawler Pitfalls to MCP Browser Control

DataFunTalk

Jun 17, 2026 · Artificial Intelligence

Mastering Data Harvesting in the Agent Era: From Crawler Pitfalls to MCP Browser Control

The article walks through the challenges of large‑scale web crawling, explains Bright Data’s adaptive Crawler API and MCP protocol, discusses compliance and proxy strategies, and shows how to build a next‑generation AI search engine with LangGraph and Python tool integration.

AI AgentsBright DataLangGraph

0 likes · 15 min read

Mastering Data Harvesting in the Agent Era: From Crawler Pitfalls to MCP Browser Control

IT Services Circle

Feb 28, 2026 · Artificial Intelligence

Unlock Adaptive Crawling, AI Agent Memory, and Remote Claude Code with Open‑Source Tools

This article introduces four open‑source projects—Scrapling for self‑adjusting web crawling, Agent‑Skills‑for‑Context‑Engineering for AI agent memory management, claude‑code‑telegram for remote Claude Code access via Telegram, and Hugging Face Skills for versatile AI task automation—detailing their core features, popularity, and installation steps.

AI AgentsHugging FacePython

0 likes · 7 min read

Unlock Adaptive Crawling, AI Agent Memory, and Remote Claude Code with Open‑Source Tools

Java Tech Enthusiast

Feb 26, 2026 · Fundamentals

Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era

From a 1993 accidental DoS attack that sparked the creation of robots.txt to modern AI crawlers ignoring the protocol, this article traces the history, purpose, and challenges of the robots exclusion standard and explores new proposals to adapt it for AI-driven web scraping.

AI ethicsSearch Engineprotocol

0 likes · 9 min read

Why the 30‑Year‑Old robots.txt Is Crumbling in the AI Era

Golang Shines

Feb 13, 2026 · Backend Development

Using Go’s Standard Library to Crawl with an HTTP Proxy

This guide demonstrates building a simple Go crawler that fetches a webpage using only the standard library, then extends it to route requests through an HTTP proxy, covering proxy parsing, custom client configuration, error handling, and essential Go best practices such as deferring response closure.

GoHTTP proxyNetwork Programming

0 likes · 5 min read

Using Go’s Standard Library to Crawl with an HTTP Proxy

Golang Shines

Feb 10, 2026 · Backend Development

Python vs Go: A Complete Guide to Choosing the Right Language for Web Crawling

The article compares Python and Go across syntax, libraries, concurrency, memory usage, readability, data processing, and deployment, concluding that Go suits large‑scale, high‑concurrency crawlers while Python excels when rich data‑analysis tools and rapid development are needed.

Pythonconcurrencydata processing

0 likes · 6 min read

Python vs Go: A Complete Guide to Choosing the Right Language for Web Crawling

IT Services Circle

Jan 31, 2026 · Information Security

Why the Humble robots.txt Is Facing an Existential Crisis in the AI Era

The article recounts a personal experiment that unintentionally launched a DoS attack, explains how that incident spurred the creation of the robots.txt protocol, and examines how AI‑driven data scraping, legal battles, and new licensing proposals are challenging its relevance today.

AI data scrapingSearch Engineinternet standards

0 likes · 10 min read

Why the Humble robots.txt Is Facing an Existential Crisis in the AI Era

php Courses

Nov 7, 2025 · Backend Development

Build a Simple PHP Web Crawler on Linux: Step-by-Step Guide

This article explains how to create a basic PHP web crawler on a Linux system, covering prerequisite installations, script development with cURL and DOMDocument, execution commands, and sample output, while emphasizing legal and ethical considerations for web scraping.

DOMDocumentLinuxPHP

0 likes · 4 min read

Build a Simple PHP Web Crawler on Linux: Step-by-Step Guide

Nightwalker Tech

Mar 14, 2025 · Backend Development

Overview and Installation Guide for Various MCP Services and Their Use with Sequential Thinking for Manus‑like Effects

This article introduces several Model Context Protocol (MCP) services—including Sequential Thinking, Firecrawl, Fetch, Hot News, Playwright, Magic, and Brave Search—provides their GitHub links, detailed Mac and Windows installation commands, and explains how to combine them with a Sequential Thinking prompt to achieve a Manus‑style AI agent workflow.

AIAutomationInstallation

0 likes · 9 min read

Overview and Installation Guide for Various MCP Services and Their Use with Sequential Thinking for Manus‑like Effects

Python Programming Learning Circle

Dec 21, 2024 · Backend Development

Comprehensive List of Python Libraries for Web Crawling, Data Processing, and Web Development

This article provides an extensive overview of Python libraries and frameworks for web crawling, data extraction, parsing, storage, browser automation, asynchronous programming, and popular web development frameworks, helping readers choose appropriate tools for their projects.

Web Developmentdata processinglibraries

0 likes · 9 min read

Comprehensive List of Python Libraries for Web Crawling, Data Processing, and Web Development

Python Crawling & Data Mining

Nov 19, 2024 · Backend Development

Mastering Python’s requests stream=True: Fast, Efficient Web Crawling

This article walks through using Python’s requests library with the stream=True parameter to efficiently filter valid URLs during web crawling, presenting two methods, code examples, execution time comparisons, and a clear explanation of the stream option’s role.

Network ProgrammingPythonStream

0 likes · 5 min read

Mastering Python’s requests stream=True: Fast, Efficient Web Crawling

Rare Earth Juejin Tech Community

Nov 7, 2024 · Backend Development

Integrating XXL‑Job for Scheduled Hot‑Search Crawlers in a Java Backend

This tutorial explains how to replace the basic @Scheduled annotation with the flexible XXL‑Job distributed scheduler, covering repository download, admin deployment, database initialization, Spring‑Boot executor configuration, job registration for Douyin and Bilibili hot‑search crawling, and a Vue front‑end component for displaying ranked results with real‑time update timestamps.

Backend DevelopmentJavaSpring Boot

0 likes · 14 min read

Integrating XXL‑Job for Scheduled Hot‑Search Crawlers in a Java Backend

Rare Earth Juejin Tech Community

Oct 29, 2024 · Backend Development

Storing Douyin and Baidu Hot Search Data with MySQL, MyBatis Generator, and Java Crawlers

This tutorial explains how to design a MySQL table for hot‑search records, generate Java entity and mapper classes using MyBatis Generator, create unique IDs for each entry, and implement scheduled Java crawlers for Douyin and Baidu hot‑search data that persist the results via Spring‑Boot services.

Database DesignJavaMyBatis

0 likes · 19 min read

Storing Douyin and Baidu Hot Search Data with MySQL, MyBatis Generator, and Java Crawlers

Open Source Tech Hub

Oct 5, 2024 · Backend Development

Boost Your PHP Crawling with PHPCreeper: A Complete Step‑by‑Step Guide

PHPCreeper is a high‑performance PHP crawler built on Workerman that leverages asynchronous I/O, multi‑process, distributed deployment and headless‑browser support; this guide covers installation via Composer, core architecture, producer/downloader/parser implementation, Redis configuration and how to start the service to fetch dynamic pages such as weather forecasts.

ComposerPHPPHPCreeper

0 likes · 13 min read

Boost Your PHP Crawling with PHPCreeper: A Complete Step‑by‑Step Guide

FunTester

Jan 29, 2024 · Operations

Curated Index of Technical Articles on Testing, Bugs, Web Crawling, UI Automation, and Selenium

This collection provides a curated list of technical articles covering performance testing strategies, bug case studies, web crawling implementations, UI automation techniques, UiAutomator usage, Selenium best practices, and mobile app performance monitoring, each with its original title and publication date.

AutomationBug Analysisperformance

0 likes · 8 min read

Curated Index of Technical Articles on Testing, Bugs, Web Crawling, UI Automation, and Selenium

Python Programming Learning Circle

Jan 24, 2024 · Backend Development

Running Scrapy Crawlers: Command‑Line, CrawlerProcess, and CrawlerRunner Approaches

This tutorial demonstrates how to execute Scrapy spiders from the command line, run them within Python files using cmdline, and manage single or multiple spiders with CrawlerProcess and CrawlerRunner, highlighting configuration steps, limitations, and best‑practice recommendations.

Backend DevelopmentCrawlerProcessCrawlerRunner

0 likes · 3 min read

Running Scrapy Crawlers: Command‑Line, CrawlerProcess, and CrawlerRunner Approaches

Python Programming Learning Circle

Jan 23, 2024 · Backend Development

Comprehensive Guide to Python Libraries for Web Crawling, Web Development, and Asynchronous Programming

This article provides an extensive overview of Python libraries and frameworks for web crawling, data extraction, asynchronous networking, browser automation, and popular web development frameworks, helping developers choose the right tools for backend projects and avoid common misconceptions when selecting a framework.

PythonWeb Frameworksasync programming

0 likes · 9 min read

Comprehensive Guide to Python Libraries for Web Crawling, Web Development, and Asynchronous Programming

Architecture and Beyond

Jul 1, 2023 · Industry Insights

Web Crawlers Unveiled: History, Value, and How to Tackle Their Challenges

This article traces the development of web crawlers from their 1990s origins to modern implementations, examines their multifaceted value in search, data analysis, and archiving, outlines technical, ethical, and legal challenges for both crawler creators and target sites, and presents practical strategies to mitigate malicious crawling.

anti-scrapingdata extractionrobots.txt

0 likes · 24 min read

Web Crawlers Unveiled: History, Value, and How to Tackle Their Challenges

Big Data Technology Architecture

Feb 11, 2023 · Backend Development

Understanding Scrapy and Twisted: Architecture, Components, and Debugging Techniques

This article explains Scrapy's comprehensive crawling framework and Twisted's event‑driven networking engine, detailing their core concepts, workflow, code execution process, and how to debug Scrapy spiders using breakpoint tracing, providing a deep technical overview for backend developers.

Backend DevelopmentPythonScrapy

0 likes · 15 min read

Understanding Scrapy and Twisted: Architecture, Components, and Debugging Techniques

FunTester

Nov 18, 2022 · Backend Development

Master Java Web Crawling: From Data Scraping to Image Storage

This guide walks beginners through building a Java web crawler that fetches bestseller book cover images, covering data scraping, HTML parsing with jsoup or regex, and saving images locally, illustrated step‑by‑step with code examples and a tiered learning roadmap.

Backend DevelopmentImage DownloadJava

0 likes · 5 min read

Master Java Web Crawling: From Data Scraping to Image Storage

Architecture Digest

Sep 24, 2022 · Information Security

Web Crawling and Anti‑Crawling Techniques: Principles, Implementation, and Countermeasures

This article explains the technical principles and implementation steps of web crawlers, introduces common crawling frameworks, provides a Python example for extracting app store rankings, and then details various anti‑crawling methods such as CSS offset, image camouflage, custom fonts, dynamic rendering, captchas, request signing, and honeypots, followed by counter‑strategies for each.

PythonScrapyanti‑crawling

0 likes · 24 min read

Web Crawling and Anti‑Crawling Techniques: Principles, Implementation, and Countermeasures

Python Programming Learning Circle

Sep 23, 2022 · Backend Development

Understanding Static and Dynamic Web Pages for Effective Web Crawling

This article explains what web crawlers are, compares static and dynamic web pages, outlines their characteristics, advantages, and challenges, and provides practical tips for extracting data from both types of pages using tools like browser developer consoles and packet‑capture utilities.

AJAXDynamic PagesHTTP

0 likes · 5 min read

Understanding Static and Dynamic Web Pages for Effective Web Crawling

Python Crawling & Data Mining

Sep 15, 2022 · Fundamentals

How to Decode URL Parameters in Python Web Crawlers: A Step‑by‑Step Guide

This article explains how to use Python's urllib library to decode URL‑encoded strings encountered in web crawling, walks through a real example with code, and shows the resulting decoded URL, helping developers troubleshoot common encoding issues.

URL encodingstring decodingurllib

0 likes · 3 min read

How to Decode URL Parameters in Python Web Crawlers: A Step‑by‑Step Guide

vivo Internet Technology

Sep 14, 2022 · Information Security

Web Crawling, Anti‑Crawling, and Anti‑Anti‑Crawling Techniques: Principles, Frameworks, and Code Examples

The article explains web‑crawling basics, Python and Scrapy examples, then surveys common anti‑crawling defenses such as CSS offsets, image camouflage, custom fonts, dynamic rendering, captchas, request signatures and honeypots, and finally presents anti‑anti‑crawling countermeasures—including CSS‑offset reversal, font decoding, headless‑browser rendering and YOLOv5‑based captcha cracking, while stressing legal compliance.

PythonScrapyanti‑crawling

0 likes · 25 min read

Web Crawling, Anti‑Crawling, and Anti‑Anti‑Crawling Techniques: Principles, Frameworks, and Code Examples

Java Architect Essentials

Aug 12, 2022 · Information Security

Case Study: Illegal Web Crawling Causing System Outage and Criminal Conviction

This article recounts the 2018 legal case in which a company's automated web crawler overloaded a municipal residence‑permit system, causing service disruption and data leakage, leading to the CTO and programmer’s conviction for damaging computer information systems.

computer crimeinformation securitylegal case

0 likes · 8 min read

Case Study: Illegal Web Crawling Causing System Outage and Criminal Conviction

Python Programming Learning Circle

Jul 13, 2022 · Backend Development

Comprehensive Scrapy Tutorial: Architecture, XPath Basics, Installation, Project Setup, and Advanced Features

This article provides a detailed walkthrough of Scrapy, covering its event‑driven architecture, component interactions, XPath parsing fundamentals, installation steps, project creation, sample spider code, item pipelines, middleware customization, and essential configuration settings for effective web crawling in Python.

MiddlewareScrapySpider

0 likes · 12 min read

Comprehensive Scrapy Tutorial: Architecture, XPath Basics, Installation, Project Setup, and Advanced Features

Python Crawling & Data Mining

Jul 9, 2022 · Big Data

Is Web Crawling Legal? Key Risks and Compliance Tips for Data Collectors

This article examines the legal risks of using web crawlers in China, covering anti‑unfair competition law, copyright, criminal and cybersecurity regulations, and offers practical compliance recommendations to avoid lawsuits and regulatory penalties.

anti-unfair competitioncopyrightcybersecurity

0 likes · 9 min read

Is Web Crawling Legal? Key Risks and Compliance Tips for Data Collectors

IT Services Circle

Jul 5, 2022 · Backend Development

Optimizing feapder Spider with Gevent: Reducing CPU Usage and Thread Count

This article demonstrates how adding two gevent monkey‑patch lines to a feapder spider reduces CPU usage from 121% to 99% while changing the effective thread count from 36 to 12, and discusses the underlying principle, performance trade‑offs, and future directions for coroutine support.

CPU optimizationPythonfeapder

0 likes · 6 min read

Optimizing feapder Spider with Gevent: Reducing CPU Usage and Thread Count

IT Services Circle

Feb 25, 2022 · Backend Development

Detecting and Handling Gzip Bombs in Web Crawling with Python Requests

This article explains how to identify gzip‑compressed responses that may be gzip bombs, how to inspect HTTP headers and raw response data using Python's requests library, and provides command‑line and code examples for measuring compressed and uncompressed sizes without triggering decompression.

compressiongziprequests

0 likes · 5 min read

Detecting and Handling Gzip Bombs in Web Crawling with Python Requests

Architecture Digest

Feb 19, 2022 · Information Security

Case Study: Illegal Web Crawling and Criminal Conviction in China

This article recounts how a corporate web‑crawling tool designed to automate housing‑loan data collection overloaded a municipal residence‑permit system, triggered a large‑scale denial‑of‑service attack, and led to the CTO and programmer being prosecuted for damaging a computer information system.

Data Scrapingcomputer crimecyberlaw

0 likes · 8 min read

Case Study: Illegal Web Crawling and Criminal Conviction in China

Java High-Performance Architecture

Feb 18, 2022 · Information Security

When Web Crawlers Cross the Line: A Legal Case Study on Unauthorized Data Scraping

This article recounts how a Chinese fintech company's automated web‑crawler, built to query a municipal residence‑permit system, overloaded the server, triggered police action, led to criminal charges for the CTO and programmer, and offers lessons on the legal risks of large‑scale data scraping.

Cloud ComputingData Scrapingcomputer crime

0 likes · 9 min read

When Web Crawlers Cross the Line: A Legal Case Study on Unauthorized Data Scraping

21CTO

Feb 5, 2022 · Information Security

When Web Crawlers Turn Criminal: A Real‑World Data Scraping Case Study

This article recounts how a fintech company's automated web‑scraping tool overloaded a municipal residence‑permit system, leading to massive data leakage, legal prosecution of its CTO and programmer, and highlights the severe legal risks of unchecked crawling practices.

Data Scrapingcomputer crimelegal case

0 likes · 9 min read

When Web Crawlers Turn Criminal: A Real‑World Data Scraping Case Study

Python Crawling & Data Mining

Oct 27, 2021 · Fundamentals

Decode the Mystery: How Python’s encode() Differs from the encoding Parameter

This article answers a fan’s question by clarifying the distinction between Python’s encode() function and the encoding parameter, explaining how encode() defaults to UTF‑8, how explicit encoding strings work, and when each is used in string handling and web crawling.

String EncodingUnicodeencode

0 likes · 4 min read

Decode the Mystery: How Python’s encode() Differs from the encoding Parameter

Java Backend Technology

Sep 21, 2021 · Backend Development

How to Crawl and Download Thousands of Sogou Images Using Java

This guide explains how to scrape thousands of images from Sogou by analyzing the request URL, extracting image URLs from JSON responses, and implementing a multithreaded Java downloader with custom HTTP utilities and pipelines to store the pictures locally.

Image DownloadJavaSogou

0 likes · 18 min read

How to Crawl and Download Thousands of Sogou Images Using Java

Selected Java Interview Questions

Sep 5, 2021 · Backend Development

Crawling and Downloading Thousands of Images from Sogou Using Java

This tutorial explains how to crawl thousands of images from Sogou using Java, detailing the request URL analysis, parameter extraction, multithreaded downloading logic, and providing complete source code for the image processor, pipeline, and HTTP utility classes.

Backend DevelopmentHTTPImage Download

0 likes · 17 min read

Crawling and Downloading Thousands of Images from Sogou Using Java

Python Programming Learning Circle

Aug 20, 2021 · Backend Development

Python Crawler for Scraping Baidu Baike Articles

This article presents a complete Python web crawler example that extracts Baidu Baike entries, detailing the implementation of URL management, page downloading, HTML parsing with BeautifulSoup, data collection, and output generation, along with sample code and usage instructions.

BaikeScrapingbeautifulsoup

0 likes · 9 min read

Python Crawler for Scraping Baidu Baike Articles

Programmer DD

Jul 31, 2021 · Backend Development

Build a Spring Boot Web Crawler with WebMagic, MyBatis, and MySQL

This tutorial demonstrates how to combine Spring Boot, WebMagic, and MyBatis to crawl Zhihu pages, configure Maven dependencies, set up MySQL data sources, define entity and mapper classes, and schedule the crawler to run periodically, providing a complete Java web‑crawling scaffold.

JavaMyBatisScheduler

0 likes · 14 min read

Build a Spring Boot Web Crawler with WebMagic, MyBatis, and MySQL

Python Crawling & Data Mining

Jun 18, 2021 · Backend Development

How to Connect Python to Elasticsearch for Powerful Search and Data Ingestion

This guide walks through installing the Python Elasticsearch client, building a reusable class with CRUD methods, importing data from MongoDB, writing a simple Baidu Baike crawler, and scaling the workflow with Celery and Flask for a complete search‑engine pipeline.

ElasticsearchFlaskPython

0 likes · 9 min read

How to Connect Python to Elasticsearch for Powerful Search and Data Ingestion

MaGe Linux Operations

Jun 1, 2021 · Backend Development

How to Run Multiple Scrapy Spiders Efficiently: Cmdline, CrawlerProcess, and CrawlerRunner

This guide demonstrates how to write a Scrapy spider, run it via the command line, use CrawlerProcess and CrawlerRunner for single and multiple spider execution, and explains the observed middleware behavior to help you choose the most reliable method.

CrawlerProcessCrawlerRunnerMultiple Spiders

0 likes · 3 min read

How to Run Multiple Scrapy Spiders Efficiently: Cmdline, CrawlerProcess, and CrawlerRunner

Python Programming Learning Circle

May 27, 2021 · Backend Development

Running Scrapy Spiders via Command Line, CrawlerProcess, and CrawlerRunner

This guide explains how to execute Scrapy spiders from the command line, within Python scripts using CrawlerProcess or CrawlerRunner, and how to manage multiple spiders efficiently, highlighting configuration steps, execution methods, and practical observations about middleware behavior.

Backend DevelopmentCrawlerProcessCrawlerRunner

0 likes · 3 min read

Running Scrapy Spiders via Command Line, CrawlerProcess, and CrawlerRunner

Huawei Cloud Developer Alliance

Mar 16, 2021 · Backend Development

Why Python Dominates Web Crawling: A Beginner’s Guide on Huawei Cloud

This article explains why Python has become a favorite language for developers, introduces the fundamentals of web crawlers, details how they work using Python libraries, and highlights practical uses and advantages, especially when running on Huawei Cloud services.

beginner tutorialdata extractionweb crawling

0 likes · 8 min read

Why Python Dominates Web Crawling: A Beginner’s Guide on Huawei Cloud

Python Crawling & Data Mining

Mar 10, 2021 · Fundamentals

How to Schedule Python Web Crawlers: 3 Simple Methods Explained

This article demonstrates three practical ways to schedule Python web‑crawling tasks—using an infinite while loop, the Timer module, and the sched module—providing code snippets, usage tips, and considerations for handling multiple runs and resource constraints.

PythonSchedulingTimer

0 likes · 6 min read

How to Schedule Python Web Crawlers: 3 Simple Methods Explained

Python Crawling & Data Mining

Jan 30, 2021 · Backend Development

Deploying Python Scrapy Crawlers with Scrapyd and Gerapy: A Step‑by‑Step Guide

This tutorial walks you through installing dependencies, running a Scrapy spider, configuring Scrapyd, packaging the project, and using Gerapy’s visual interface to manage and deploy a Python web crawler for Qiushibaike jokes.

GerapyPythonScrapyd

0 likes · 10 min read

Deploying Python Scrapy Crawlers with Scrapyd and Gerapy: A Step‑by‑Step Guide

Tencent Cloud Developer

Jan 21, 2021 · Big Data

A Beginner's Guide to Using Scrapy for Web Crawling

This beginner‑friendly guide walks readers through installing Scrapy, creating a project and spider, running and debugging crawlers, implementing parsing with CSS/XPath, and overcoming common hurdles such as JavaScript rendering, user‑agent spoofing, and proxy rotation via configurable middlewares, enabling quick start of web‑crawling projects.

MiddlewarePythonScrapy

0 likes · 13 min read

A Beginner's Guide to Using Scrapy for Web Crawling

Huawei Cloud Developer Alliance

Jan 12, 2021 · Big Data

When Web Crawlers Cross the Legal Line: Big Data Insights & Risk Guidance

This article explains how web crawling technology works, distinguishes it from search‑engine bots, analyzes recent criminal cases involving crawlers with big‑data visualizations, and offers practical legal advice for developers and data professionals to avoid liability.

Legal AnalysisSoftware Compliancedata privacy

0 likes · 8 min read

When Web Crawlers Cross the Legal Line: Big Data Insights & Risk Guidance

Huawei Cloud Developer Alliance

Jan 7, 2021 · Big Data

When Web Crawlers Cross the Legal Line: Data‑Driven Case Analysis

This article examines the rise of web crawler technology in big‑data contexts, clarifies the distinction between legitimate data collection and illegal intrusion, presents statistical analysis of recent court cases involving crawlers, and offers practical legal guidelines for developers and data professionals to avoid criminal liability.

Legal AnalysisSoftware Compliancedata privacy

0 likes · 8 min read

When Web Crawlers Cross the Legal Line: Data‑Driven Case Analysis

FunTester

Dec 15, 2020 · Backend Development

Run All Scrapy Spiders Together and Fix Video Download Errors

This guide shows how to create a custom Scrapy command to launch every spider at once, separate each spider's settings for better modularity, and resolve video download problems by adjusting request headers and handling file saving correctly.

Custom CommandPythonRedis

0 likes · 5 min read

Run All Scrapy Spiders Together and Fix Video Download Errors

Python Crawling & Data Mining

Nov 13, 2020 · Backend Development

Master Scrapy Requests: Download Pages and Trigger Callbacks Efficiently

This tutorial explains how to use Scrapy's Request objects to feed article detail URLs into the crawler, configure callbacks for parsing, handle relative URLs with urljoin, and yield requests so Scrapy can download pages, completing the core data extraction workflow.

PythonScrapyWeb Scraping

0 likes · 5 min read

Master Scrapy Requests: Download Pages and Trigger Callbacks Efficiently

Top Architect

Oct 31, 2020 · Big Data

Building a Zhihu User Data Crawler and Large‑Scale Analysis with SpringBoot, SeimiCrawler, RabbitMQ, ElasticSearch, and Kibana

This article describes how to build a Java‑based crawler to collect millions of Zhihu user profiles, handle anti‑crawling measures with rotating user‑agents and a proxy pool, deduplicate data using a Bloom filter, import the results into ElasticSearch, and analyze the dataset with Kibana and ECharts visualizations.

Big DataElasticsearchJava

0 likes · 15 min read

Building a Zhihu User Data Crawler and Large‑Scale Analysis with SpringBoot, SeimiCrawler, RabbitMQ, ElasticSearch, and Kibana

ITPUB

Oct 23, 2020 · Fundamentals

How General Search Engines Work: From Crawlers to Ranking

This article provides a comprehensive overview of general search engines, covering their classification, core workflow, key modules such as web crawlers, content processing, storage, user query handling, ranking strategies like TF‑IDF and PageRank, as well as anti‑cheat measures and user intent understanding.

Information RetrievalPageRankSearch Engine

0 likes · 16 min read

How General Search Engines Work: From Crawlers to Ranking

Python Crawling & Data Mining

Oct 20, 2020 · Backend Development

Master Web Scraping with XPath: A Step‑by‑Step Scrapy Tutorial

This tutorial shows how to apply XPath expressions within the Scrapy framework to extract titles, publication dates, tags, content, likes, favorites, and comments from a sample website, providing practical code snippets and tips for reliable web data collection.

PythonScrapyWeb Scraping

0 likes · 5 min read

Master Web Scraping with XPath: A Step‑by‑Step Scrapy Tutorial

Python Crawling & Data Mining

Jul 22, 2020 · Backend Development

Build a Simple Python Web Crawler to Search and Download Files from Pansou

This tutorial walks through building a simple Python web crawler that searches the Pansou site, handles AJAX‑loaded JSON data, extracts file links, and enables interactive downloading, illustrating core backend techniques such as HTTP GET requests, JSON parsing, and user‑driven pagination.

AutomationFile DownloadPython

0 likes · 4 min read

Build a Simple Python Web Crawler to Search and Download Files from Pansou

FunTester

Jul 1, 2020 · Operations

Curated List of Performance Testing, Bug Cases, and Web Crawling Articles

This collection provides a curated set of links to articles covering performance testing strategies, notable bug analyses, and practical web crawling implementations, offering valuable insights for software testers and engineers seeking to improve testing practices and data extraction techniques.

Bug Analysisbackendsoftware testing

0 likes · 6 min read

Curated List of Performance Testing, Bug Cases, and Web Crawling Articles

Python Programming Learning Circle

Jun 20, 2020 · Information Security

Bypassing Implicit Style‑CSS Anti‑Scraping: Analysis and Restoration of Obfuscated Content

This article explains how many Chinese web sites use hidden CSS ::before content to hide characters, shows how to locate the relevant network request, decode the span class mappings from obfuscated JavaScript, and restore the original text for successful web scraping.

JavaScriptObfuscationanti-scraping

0 likes · 10 min read

Bypassing Implicit Style‑CSS Anti‑Scraping: Analysis and Restoration of Obfuscated Content

Python Programming Learning Circle

Jun 3, 2020 · Information Security

Anti‑Crawling Techniques: Server‑Side and Client‑Side Detection Strategies

The article examines why web content needs protection, explains common server‑side header checks, describes client‑side JavaScript fingerprinting and headless‑browser detection methods, and outlines practical anti‑crawling measures such as CAPTCHAs and robots.txt, highlighting the ongoing cat‑and‑mouse game between crawlers and defenders.

HTTP header inspectionanti‑crawlingcaptcha

0 likes · 12 min read

Anti‑Crawling Techniques: Server‑Side and Client‑Side Detection Strategies

Python Crawling & Data Mining

May 29, 2020 · Big Data

How to Connect Python to Elasticsearch for Efficient Data Crawling and Search

This guide explains how to install the Elasticsearch Python client, build a wrapper class for index management and CRUD operations, import data from MongoDB, use a Celery‑based crawler to harvest Baidu Baike content, and expose search functionality through Flask or other Python web frameworks.

ElasticsearchFlaskMongoDB

0 likes · 7 min read

How to Connect Python to Elasticsearch for Efficient Data Crawling and Search

Python Programming Learning Circle

Apr 17, 2020 · Artificial Intelligence

Building a Celebrity Face Recognition System with Baidu API and Python

This article details a step‑by‑step tutorial for creating a celebrity face‑matching application that crawls star information, stores images and metadata in a MySQL database, and uses Baidu's facial recognition API to compare uploaded photos, outputting similarity scores and matched celebrity details.

AIBaidu APIface recognition

0 likes · 11 min read

Building a Celebrity Face Recognition System with Baidu API and Python

Java Architecture Diary

Mar 25, 2020 · Backend Development

Mastering mica-http v1.1.7: Advanced Crawling Techniques and Proxy Management

This article continues the mica-http guide, showcasing how version v1.1.7 introduces enhanced proxy handling, retry mechanisms, page crawling, model integration, and result processing, while providing documentation links, example projects, and open‑source tool recommendations for building lightweight Java crawlers.

Backend DevelopmentJavamica-http

0 likes · 3 min read

Mastering mica-http v1.1.7: Advanced Crawling Techniques and Proxy Management

Java Architecture Diary

Jan 22, 2020 · Backend Development

Unlock Advanced Crawling with mica-http v1.1.7: Proxies, Retries, and Models

This guide continues the mica‑http tutorial, detailing the new v1.1.7 release, proxy and retry mechanisms, page crawling steps, model usage, result handling, and provides documentation links and open‑source tool recommendations for building lightweight backend crawlers.

Backend DevelopmentHTTP Clientmica-http

0 likes · 3 min read

Unlock Advanced Crawling with mica-http v1.1.7: Proxies, Retries, and Models

Python Programming Learning Circle

Jan 2, 2020 · Backend Development

How to Crawl Responsibly: Avoid Legal Risks and Server Overload

This guide outlines responsible web‑crawling practices, covering robots.txt compliance, legal pitfalls such as unauthorized personal data and copyrighted content, recommended request intervals, and relevant Chinese data‑security regulations, helping developers avoid server overloads and potential lawsuits.

Backend DevelopmentData EthicsScrapy

0 likes · 4 min read

How to Crawl Responsibly: Avoid Legal Risks and Server Overload

MaGe Linux Operations

Dec 27, 2019 · Backend Development

Master Scrapy: Build Powerful Python Web Crawlers Step‑by‑Step

This guide introduces the Scrapy framework, explains its architecture—including engine, scheduler, downloader, spiders, pipelines, and middlewares—covers installation, project setup, item definition, spider coding, pipeline handling, pagination, and provides practical code examples for extracting data from Douban books.

Item PipelineMiddlewarePython

0 likes · 18 min read

Master Scrapy: Build Powerful Python Web Crawlers Step‑by‑Step

Python Programming Learning Circle

Dec 26, 2019 · Backend Development

Master Web Crawling with Python: From Requests to XPath Extraction

This guide walks you through the fundamentals of building a web crawler in Python, covering how to fetch pages with the Requests library, extract data using regular expressions and XPath, and provides practical code examples for each step.

XPathdata extractionregex

0 likes · 9 min read

Master Web Crawling with Python: From Requests to XPath Extraction

21CTO

Dec 9, 2019 · Big Data

China’s Big Data Crackdown: Legal Risks Every Developer Should Know

The article examines the sweeping regulatory crackdown on China’s big‑data and financial‑risk companies, detailing the dissolution of major crawler firms, new legal restrictions on data collection, and practical guidance on what data‑scraping activities are illegal and how to protect personal information.

Big Datadata privacyfinancial technology

0 likes · 11 min read

China’s Big Data Crackdown: Legal Risks Every Developer Should Know

Programmer DD

Dec 7, 2019 · Backend Development

Why Choose Java Over Python for Web Crawling? A Practical Guide

The article shares the author's journey from manual data collection to mastering Java web crawlers, explains why Java is preferred over Python, outlines the five-step crawling workflow, covers essential Java basics, HTTP fundamentals, and provides code examples for URL queuing, time parsing, and timestamp conversion.

Backend DevelopmentHTTPJava

0 likes · 12 min read

Why Choose Java Over Python for Web Crawling? A Practical Guide

21CTO

Nov 16, 2019 · Fundamentals

From Early Crawlers to ByteDance: A History of Web Scraping

This article traces the evolution of web crawlers—from early Perl scripts to modern ByteDance agents—explaining their role in search engines, business models, anti‑crawling measures, and the impact on content creation and competition.

Data ScrapingSearch Enginecontent aggregation

0 likes · 6 min read

From Early Crawlers to ByteDance: A History of Web Scraping

Java Backend Technology

Oct 31, 2019 · Operations

When Does Programming Cross the Legal Line? A Developer's Risk Guide

This article explains how common programming activities such as web crawling, developing gambling or adult sites, P2P platforms, and game cheats can violate Chinese laws, outlines the legal criteria for each case, and offers practical advice for developers to protect themselves from criminal liability.

data privacylegal complianceprogrammer risk

0 likes · 18 min read

When Does Programming Cross the Legal Line? A Developer's Risk Guide

FunTester

Oct 19, 2019 · Backend Development

Building a Fast Historical‑Today Crawler with Java and MySQL

An open‑source Java crawler that fetches historical‑today events from a public API is presented, detailing three practical challenges—GET request length limits, ambiguous JSON value types, and month string construction—along with a full code example and a GitHub repository link for reference.

GitHubHTTPdata extraction

0 likes · 5 min read

Building a Fast Historical‑Today Crawler with Java and MySQL

Python Crawling & Data Mining

Oct 11, 2019 · Backend Development

How to Build a Python Web Crawler to Map 2019 Chinese National Day Travel Hotspots

This article walks through the complete process of designing, implementing, and visualizing a Python web crawler that extracts tourism hotspot data from ticketing sites for China's 2019 National Day holiday, covering requirement analysis, URL and element inspection, data collection, cleaning, and geographic heat‑map presentation using Pyecharts.

PyechartsPythonTourism

0 likes · 11 min read

How to Build a Python Web Crawler to Map 2019 Chinese National Day Travel Hotspots

FunTester

Sep 15, 2019 · Backend Development

How to Build a Java HttpClient Spider for Scraping Movie Details and Download Links

This article explains how to update and use a Java HttpClient‑based spider that removes duplicate links, handles legacy page formats, extracts movie metadata and download URLs (magnet, ed2k, Baidu Pan), and stores the results in a MySQL database, with complete source code examples.

HttpClientJavaScraping

0 likes · 12 min read

How to Build a Java HttpClient Spider for Scraping Movie Details and Download Links

21CTO

Aug 16, 2019 · Backend Development

Master Scrapy: Build, Deploy, and Scale a Python Web Crawler Platform

This guide walks through designing a full‑featured web‑crawler platform, covering rule maintenance, job scheduling, async and real‑time crawling with Scrapy, project setup, item pipelines, settings, local execution, custom parameters, server deployment via Scrapyd, API usage, and fast real‑time crawling with Requests, BeautifulSoup, Flask, and multithreading.

FlaskPythonScrapy

0 likes · 16 min read

Master Scrapy: Build, Deploy, and Scale a Python Web Crawler Platform

Architecture Digest

Aug 15, 2019 · Backend Development

Design and Implementation of a Scrapy‑Based Web Crawling Platform

This article explains how to design a flexible web‑crawling platform using Scrapy, covering rule maintenance, job scheduling, asynchronous and real‑time crawlers, project setup, code structure, settings, local execution, deployment with scrapyd, API usage, and examples of Flask‑based real‑time services.

FlaskPythonScrapy

0 likes · 16 min read

Design and Implementation of a Scrapy‑Based Web Crawling Platform

360 Tech Engineering

May 31, 2019 · Information Security

Dynamic Web Crawling Techniques for Vulnerability Scanning with Pyppeteer

This article details the practical implementation of a dynamic web crawler for vulnerability scanning, covering Chrome headless setup, browser initialization, JavaScript hook injection for DOM events, navigation locking, form handling, link collection, deduplication, and task scheduling using pyppeteer.

Dynamic Analysisbrowser automationjavascript hooking

0 likes · 30 min read

Dynamic Web Crawling Techniques for Vulnerability Scanning with Pyppeteer

58 Tech

May 8, 2019 · Information Security

Overview of Web Crawling, Anti‑Crawling Techniques, and 58 Anti‑Crawling System

This article introduces the fundamentals of web crawlers, typical crawling methods, and a comprehensive set of anti‑crawling strategies—including IP control, browser and device simulation, CAPTCHA cracking, and traffic analysis—while detailing the architecture and capabilities of the 58 anti‑crawling platform.

Traffic analysisanti‑crawlingbot detection

0 likes · 17 min read

Overview of Web Crawling, Anti‑Crawling Techniques, and 58 Anti‑Crawling System

JavaEdge

Mar 21, 2019 · Backend Development

Master Web Crawling with Scrapy: From Tech Choices to Powerful Regex Extraction

This guide walks through selecting Scrapy over Requests + BeautifulSoup, explains web page types, outlines crawler use‑cases, details regular‑expression syntax and non‑greedy matching, demonstrates practical regex patterns with images, compares depth‑first and breadth‑first crawling, and covers URL deduplication and string‑encoding pitfalls in Python.

PythonScrapyregex

0 likes · 11 min read

Master Web Crawling with Scrapy: From Tech Choices to Powerful Regex Extraction

Python Crawling & Data Mining

Feb 18, 2019 · Backend Development

How to Build Your First Scrapy Project on Windows: Step‑by‑Step Guide

This article walks you through setting up a Windows virtual environment, installing Scrapy, creating a new Scrapy project, exploring its directory structure, and opening it in PyCharm, providing clear commands and screenshots for each step.

PythonScrapyproject setup

0 likes · 6 min read

How to Build Your First Scrapy Project on Windows: Step‑by‑Step Guide

Python Crawling & Data Mining

Feb 6, 2019 · Backend Development

Master Scrapy: Build Powerful Python Web Crawlers in Minutes

This article introduces the Scrapy framework, explains its architecture and five core components, guides you through creating a Scrapy project, configuring spiders, pipelines, and middlewares, and demonstrates how to run the crawler to efficiently collect and process web data using Python.

Backend DevelopmentPythonScrapy

0 likes · 7 min read

Master Scrapy: Build Powerful Python Web Crawlers in Minutes

MaGe Linux Operations

Jan 13, 2019 · Backend Development

How to Build a High‑Performance Novel Site Crawler with MongoDB

This article walks through building a novel‑site crawler using MongoDB, detailing how to extract category links, manage URL states across multiple processes, and handle deduplication, while sharing screenshots of the framework, database logic, and final results.

MongoDBmultithreadingweb crawling

0 likes · 3 min read

How to Build a High‑Performance Novel Site Crawler with MongoDB

Python Crawling & Data Mining

Jan 13, 2019 · Backend Development

How to Fix Common Scrapy Installation Errors on Windows

This guide walks you through step‑by‑step solutions for typical Scrapy installation problems on Windows, covering missing libxml2/lxml wheels, Visual C++ requirements, and Twisted wheel compatibility, so you can get the framework up and running smoothly.

PythonScrapyweb crawling

0 likes · 7 min read

How to Fix Common Scrapy Installation Errors on Windows

MaGe Linux Operations

Nov 19, 2018 · Backend Development

How to Crawl Complete Qidian Novels with Scrapy on Ubuntu

This tutorial explains how to use Scrapy on Ubuntu to create a project, define items, set up pipelines and settings, write a spider, and scrape completed novels from Qidian, while noting the VIP access limitation.

QidianScrapyUbuntu

0 likes · 3 min read

How to Crawl Complete Qidian Novels with Scrapy on Ubuntu

Python Crawling & Data Mining

Nov 13, 2018 · Fundamentals

Mastering Breadth-First Search for Web Crawling: Concepts and Code

This article explains the breadth‑first search (BFS) strategy for web crawling, contrasts it with depth‑first search, describes its layer‑by‑layer queue implementation, and walks through a complete Python code example, highlighting why both algorithms are essential interview topics.

Breadth-First SearchPythonweb crawling

0 likes · 5 min read

Mastering Breadth-First Search for Web Crawling: Concepts and Code

UC Tech Team

Nov 5, 2018 · Artificial Intelligence

News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation

To accurately distinguish news pages from other web page types, this study formulates the task as a binary classification problem, extracts 19 engineered features from HTML, evaluates logistic regression and SVM models with cross‑validation, and achieves over 90% precision, recall, and F1‑score using LR with Newton method.

binary classificationfeature engineeringlogistic regression

0 likes · 13 min read

News Page Identification Using Machine Learning: Feature Engineering, Model Selection, and Evaluation

Python Crawling & Data Mining

Sep 23, 2018 · Fundamentals

Master Python Regex: Using $ and ? with Non‑Greedy Patterns

This tutorial explains Python regular expression special characters like $ and ?, demonstrates greedy versus non‑greedy matching with practical examples, and shows how to capture substrings correctly for web‑crawling tasks.

Pythonnon-greedyregex

0 likes · 6 min read

Master Python Regex: Using $ and ? with Non‑Greedy Patterns

21CTO

Sep 7, 2018 · Backend Development

Why Scaling Web Crawlers Is Harder Than You Think: Lessons from 1,000B Pages

This article outlines the major challenges of large‑scale e‑commerce product data extraction—such as ever‑changing site formats, scalable architecture, performance throughput, anti‑bot defenses, and data quality—and shares the hard‑won lessons Scrapinghub gained after crawling over a trillion product pages.

ScaleScrapydata extraction

0 likes · 15 min read

Why Scaling Web Crawlers Is Harder Than You Think: Lessons from 1,000B Pages

MaGe Linux Operations

Aug 3, 2018 · Backend Development

How to Build a High‑Performance Magnet Link Crawler with Python and DHT

This article explains the structure of magnet links, the role of DHT in BitTorrent, and provides a step‑by‑step guide to creating a Python‑based magnet link crawler that stores results in Redis and converts them to torrent files using aria2.

DHTRedisbitTorrent

0 likes · 8 min read

How to Build a High‑Performance Magnet Link Crawler with Python and DHT

Qunar Tech Salon

Jul 25, 2018 · Information Security

Understanding Web Crawlers: Definitions, Types, Traffic, and Harm

This article introduces web crawlers, classifies them by technology and intent, presents statistics on crawler traffic across industries and regions, and analyzes the various harms they cause, laying the groundwork for future discussions on anti‑crawling strategies.

Traffic analysisanti‑crawlingcrawler classification

0 likes · 10 min read

Understanding Web Crawlers: Definitions, Types, Traffic, and Harm

MaGe Linux Operations

Jun 21, 2018 · Fundamentals

Getting Started with Web Crawlers: Inspect Elements and Simple Python Requests Demo

This tutorial introduces web crawlers, explains how to inspect and temporarily modify page HTML using browser developer tools, and provides a hands‑on Python example that fetches a page with the requests library and prints its source code.

HTMLinspect elementrequests

0 likes · 5 min read

Getting Started with Web Crawlers: Inspect Elements and Simple Python Requests Demo

MaGe Linux Operations

Jun 2, 2018 · Backend Development

How to Build a High-Performance Proxy IP Pool for Web Crawlers with Python

Learn how to design and implement a robust proxy IP pool for distributed web crawlers using Python, Flask, and SSDB, covering proxy acquisition, quality testing, storage, API services, scheduling, installation, and usage examples to ensure fast and stable crawling.

APIFlaskSSDB

0 likes · 8 min read

How to Build a High-Performance Proxy IP Pool for Web Crawlers with Python

MaGe Linux Operations

May 16, 2018 · Backend Development

Essential Python Libraries for Web Crawling and Web Development

This guide outlines the core steps of a web request, then presents a comprehensive catalog of Python libraries for crawling, parsing, text processing, automation, concurrency, cloud execution, and popular web frameworks, helping developers choose the right tools for backend projects.

frameworkslibrariesweb crawling

0 likes · 10 min read

Essential Python Libraries for Web Crawling and Web Development

MaGe Linux Operations

Jan 5, 2018 · Backend Development

How to Build a High‑Speed Sina Weibo Scrapy Spider that Crawls 13 Million Posts Daily

This article explains how to create a Python‑based Scrapy spider that logs into Sina Weibo using cookies, crawls user profiles, posts, followers and followees from the WAP site at speeds exceeding 13 million records per day, and stores the data in MongoDB.

MongoDBPythonScrapy

0 likes · 6 min read

How to Build a High‑Speed Sina Weibo Scrapy Spider that Crawls 13 Million Posts Daily

MaGe Linux Operations

Dec 5, 2017 · Information Security

How to Defend Your Website Against Web Crawlers: Techniques & Tools

This article explores why web content needs protection, explains common server‑side and client‑side anti‑crawling methods—including User‑Agent checks, token cookies, headless‑browser detection, fingerprinting, captchas, and robots.txt—and offers practical guidance for raising the cost of unauthorized scraping.

Browser FingerprintingHeadless Browseranti‑crawling

0 likes · 12 min read

How to Defend Your Website Against Web Crawlers: Techniques & Tools

MaGe Linux Operations

Nov 20, 2017 · Backend Development

Mastering Web Crawlers: Core Principles, Architecture, and Modern Challenges

This article explains how web crawlers work—from initial URL seeding and request handling to flow control, content extraction, and handling dynamic pages—while covering essential modules, HTTP details, common obstacles like JavaScript rendering, anti‑scraping measures, and strategies for large‑scale, distributed crawling.

HTTPdata extractiondistributed systems

0 likes · 14 min read

Mastering Web Crawlers: Core Principles, Architecture, and Modern Challenges

MaGe Linux Operations

Aug 10, 2017 · Backend Development

Explore the Ultimate Python Library Collection for Web Crawling and Data Processing

This comprehensive guide lists essential Python libraries for network operations, asynchronous programming, web crawling frameworks, HTML/XML parsing, text handling, data conversion, slug creation, office document manipulation, PDF processing, markdown rendering, YAML handling, CSS utilities, feed parsing, SQL tools, HTTP clients, microformats, executable analysis, PSD handling, natural language processing, browser automation, headless tools, multiprocessing, queues, cloud execution, email handling, URL manipulation, web content extraction, video downloading, wiki archiving, WebSocket communication, DNS queries, computer vision, proxy services, and miscellaneous utilities.

NetworkPythondata processing

0 likes · 17 min read

Explore the Ultimate Python Library Collection for Web Crawling and Data Processing

21CTO

Jun 24, 2017 · Information Security

Why 95% of Web Traffic Is Bots: Inside the Crawling Arms Race

The article explores the hidden, high‑traffic world of web crawlers and anti‑crawling measures, revealing why most online requests are bots, how companies decide to crawl or block, the technical and organizational challenges involved, and what the future may hold for this perpetual cat‑and‑mouse game.

Industryanti‑crawlingbackend

0 likes · 22 min read

Why 95% of Web Traffic Is Bots: Inside the Crawling Arms Race

Qunar Tech Salon

Jun 22, 2017 · Information Security

The Dark Side of Web Crawling and Anti‑Crawling: Industry Realities and Technical Challenges

This article explores the often hidden and contentious world of web crawling and anti‑crawling, detailing industry motivations, the massive proportion of bot traffic, the technical arms race between scrapers and defenders, and the broader impact on developers, companies, and security practices.

JavaScriptPythonanti‑crawling

0 likes · 21 min read

The Dark Side of Web Crawling and Anti‑Crawling: Industry Realities and Technical Challenges

Baidu Intelligent Testing

Jun 20, 2017 · Big Data

Design and Challenges of Web Crawlers and Link Scheduling for Knowledge Graph Construction

The article explains how web crawlers (spiders) collect data for knowledge graphs, covering core tasks, major challenges, crawler features, new‑link expansion, storage design, link‑selection scheduling strategies, and the role of large‑scale data mining and machine learning in optimizing crawl efficiency.

Big DataKnowledge GraphSpider

0 likes · 17 min read

Design and Challenges of Web Crawlers and Link Scheduling for Knowledge Graph Construction

MaGe Linux Operations

Jun 3, 2017 · Information Security

The Dark Side of Web Crawling: Industry Secrets, Technical Battles, and Future Trends

This article explores the hidden, often unglamorous world of web crawling and anti‑crawling, detailing why companies need these technologies, the massive traffic they generate, the technical arms race between crawlers and defenders, and the evolving strategies and challenges that shape the industry today.

anti‑crawlinge-commerceinformation security

0 likes · 21 min read

The Dark Side of Web Crawling: Industry Secrets, Technical Battles, and Future Trends

Ctrip Technology

May 22, 2017 · Information Security

The Dark Side of Web Crawling and Anti‑Crawling: Industry Realities and Technical Strategies

This article examines the hidden, often unglamorous world of web crawling and anti‑crawling, revealing why companies deploy aggressive scraping and defensive measures, the technical arms race between crawlers and defenders, the impact on engineers' careers, and future trends in this contested space.

Data Scrapinganti‑crawlinginformation security

0 likes · 21 min read

The Dark Side of Web Crawling and Anti‑Crawling: Industry Realities and Technical Strategies

MaGe Linux Operations

Mar 28, 2017 · Backend Development

Master Scrapy: Build a Complete DMOZ Crawler in Four Simple Steps

This tutorial walks you through creating a Scrapy project, defining items, writing a spider, and exporting data to crawl the DMOZ website, covering command‑line setup, XPath extraction, handling encoding errors, and using pipelines for storage.

Item PipelinePythonScrapy

0 likes · 11 min read

Master Scrapy: Build a Complete DMOZ Crawler in Four Simple Steps

MaGe Linux Operations

Mar 28, 2017 · Backend Development

Master Scrapy: Step-by-Step Guide to Install the Powerful Python Web Crawler

This article walks you through the complete installation process for Scrapy, the Python-based web crawling framework, covering prerequisite Python setup, required dependencies like lxml, setuptools, zope.interface, Twisted, pyOpenSSL, win32py, and finally verifying the installation, preparing you for large‑scale data extraction tasks.

InstallationPythonScrapy

0 likes · 4 min read

Master Scrapy: Step-by-Step Guide to Install the Powerful Python Web Crawler

21CTO

Nov 20, 2016 · Backend Development

Mastering Web Crawlers: Strategies, Tools, and Practical Code Samples

This article explores the fundamentals and advanced techniques of building web crawlers, covering crawler types, essential features, RSS/ATOM harvesting, custom scraping methods, PHP header manipulation, regex extraction, and concurrency, providing actionable code examples for backend developers.

Backend DevelopmentRSSScraping

0 likes · 9 min read

Mastering Web Crawlers: Strategies, Tools, and Practical Code Samples