Fundamentals 13 min read

Master Web Crawling: Focused, General, Incremental & Deep Techniques in Python

This article introduces various web crawling strategies—including focused crawlers, general-purpose crawlers, incremental crawlers, and deep‑web crawlers—explains their underlying principles, presents practical Python code examples for image, e‑commerce and movie data extraction, and discusses deduplication methods and form‑filling techniques.

Python Crawling & Data Mining

Mar 21, 2021

Master Web Crawling: Focused, General, Incremental & Deep Techniques in Python

Introduction

Network crawlers are a universal method for automatically collecting data. This article introduces different types of crawlers.

1. Focused Crawlers

Focused crawlers target specific topics, while general web crawlers are essential components of search engine indexing systems, downloading webpages to create a mirror backup.

Incremental crawling means automatically fetching newly added or changed data on a site.

Web pages can be divided into surface web and deep web.

Surface Web : static pages indexed by traditional search engines.

Deep Web : pages hidden behind forms, accessible only after submitting keywords.

2. Focused Crawling Techniques

Focused crawlers (focused crawler) evaluate link and content importance. Link‑evaluation strategies include HITS, which computes Authority and Hub weights to prioritize links.

Content‑evaluation strategies use similarity calculations, such as the Fish‑Search algorithm and its improved Shark‑Search version based on vector space models.

Example: a simple image‑focused crawler.

import urllib.request
# crawler‑specific package
import re
keyname = ""
key = urllib.request.quote(keyname)
for i in range(0,5):
    url = "https://s.taobao.com/search?q=" + key + "&..." + str(i*44)
    pat = '"pic_url":"//(.*?)"'
    # ... (rest of code omitted for brevity)

3. General‑Purpose Crawling Techniques

General crawlers follow a five‑step process: obtain initial URLs, fetch pages to discover new URLs, enqueue new URLs, dequeue and crawl them, and stop when a termination condition is met.

Typical strategies include breadth‑first and depth‑first traversal.

Example: crawling JD.com product information using Selenium.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

def get_good(driver):
    try:
        js_code = '''
            window.scrollTo(0,5000);
        '''
        driver.execute_script(js_code)
        time.sleep(2)
        good_list = driver.find_elements_by_class_name('gl-item')
        # ... (rest of code omitted)

4. Incremental Crawling

Incremental crawlers monitor website updates and fetch only new or changed data. Three deduplication approaches are described: checking URLs before requests, checking content after parsing, and checking storage before insertion.

5. Deep‑Web Crawling

Deep‑Web crawlers must handle form submission to access hidden pages. Two form‑filling methods are presented: knowledge‑based keyword libraries and structure‑analysis‑based automatic filling.

Code Example: Scrapy Incremental Crawler

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis

class MovieSpider(CrawlSpider):
    name = 'movie'
    start_urls = ['http://www.4567tv.tv/frim/index7-11.html']
    rules = (Rule(LinkExtractor(allow=r'/frim/index7-\d+\.html'), callback='parse_item', follow=True),)

    conn = Redis(host='127.0.0.1', port=6379)

    def parse_item(self, response):
        li_list = response.xpath('//li[@class="p1 m1"]')
        for li in li_list:
            detail_url = 'http://www.4567tv.tv' + li.xpath('./a/@href').extract_first()
            ex = self.conn.sadd('urls', detail_url)
            if ex == 1:
                yield scrapy.Request(url=detail_url, callback=self.parst_detail)

    def parst_detail(self, response):
        item = IncrementproItem()
        item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first()
        item['kind'] = ''.join(response.xpath('//div[@class="ct-c"]/dl/dt[4]//text()').extract())
        yield item

These sections provide a comprehensive overview of web crawling techniques and practical Python implementations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scrapy deep web incremental crawling

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.