Master Web Crawling: Focused, General, Incremental & Deep Techniques in Python
This article introduces various web crawling strategies—including focused crawlers, general-purpose crawlers, incremental crawlers, and deep‑web crawlers—explains their underlying principles, presents practical Python code examples for image, e‑commerce and movie data extraction, and discusses deduplication methods and form‑filling techniques.
Introduction
Network crawlers are a universal method for automatically collecting data. This article introduces different types of crawlers.
1. Focused Crawlers
Focused crawlers target specific topics, while general web crawlers are essential components of search engine indexing systems, downloading webpages to create a mirror backup.
Incremental crawling means automatically fetching newly added or changed data on a site.
Web pages can be divided into surface web and deep web.
Surface Web : static pages indexed by traditional search engines.
Deep Web : pages hidden behind forms, accessible only after submitting keywords.
2. Focused Crawling Techniques
Focused crawlers (focused crawler) evaluate link and content importance. Link‑evaluation strategies include HITS, which computes Authority and Hub weights to prioritize links.
Content‑evaluation strategies use similarity calculations, such as the Fish‑Search algorithm and its improved Shark‑Search version based on vector space models.
Example: a simple image‑focused crawler.
import urllib.request
# crawler‑specific package
import re
keyname = ""
key = urllib.request.quote(keyname)
for i in range(0,5):
url = "https://s.taobao.com/search?q=" + key + "&..." + str(i*44)
pat = '"pic_url":"//(.*?)"'
# ... (rest of code omitted for brevity)3. General‑Purpose Crawling Techniques
General crawlers follow a five‑step process: obtain initial URLs, fetch pages to discover new URLs, enqueue new URLs, dequeue and crawl them, and stop when a termination condition is met.
Typical strategies include breadth‑first and depth‑first traversal.
Example: crawling JD.com product information using Selenium.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
def get_good(driver):
try:
js_code = '''
window.scrollTo(0,5000);
'''
driver.execute_script(js_code)
time.sleep(2)
good_list = driver.find_elements_by_class_name('gl-item')
# ... (rest of code omitted)4. Incremental Crawling
Incremental crawlers monitor website updates and fetch only new or changed data. Three deduplication approaches are described: checking URLs before requests, checking content after parsing, and checking storage before insertion.
5. Deep‑Web Crawling
Deep‑Web crawlers must handle form submission to access hidden pages. Two form‑filling methods are presented: knowledge‑based keyword libraries and structure‑analysis‑based automatic filling.
Code Example: Scrapy Incremental Crawler
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis
class MovieSpider(CrawlSpider):
name = 'movie'
start_urls = ['http://www.4567tv.tv/frim/index7-11.html']
rules = (Rule(LinkExtractor(allow=r'/frim/index7-\d+\.html'), callback='parse_item', follow=True),)
conn = Redis(host='127.0.0.1', port=6379)
def parse_item(self, response):
li_list = response.xpath('//li[@class="p1 m1"]')
for li in li_list:
detail_url = 'http://www.4567tv.tv' + li.xpath('./a/@href').extract_first()
ex = self.conn.sadd('urls', detail_url)
if ex == 1:
yield scrapy.Request(url=detail_url, callback=self.parst_detail)
def parst_detail(self, response):
item = IncrementproItem()
item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first()
item['kind'] = ''.join(response.xpath('//div[@class="ct-c"]/dl/dt[4]//text()').extract())
yield itemThese sections provide a comprehensive overview of web crawling techniques and practical Python implementations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
