Backend Development 14 min read

Design and Implementation of a Java Web Crawler Framework Inspired by Scrapy

This article explains how to design and build a lightweight Java web crawler framework, covering crawler fundamentals, anti‑scraping challenges, core components such as URL manager, scheduler, downloader, parser and pipeline, and provides concrete code examples and architectural diagrams.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Implementation of a Java Web Crawler Framework Inspired by Scrapy

Inspired by Python's Scrapy, the author presents a Java‑based web crawler framework called Elves , with source code available on GitHub. The article first defines what a web crawler (spider) is, discusses the need for polite crawling, and outlines common anti‑crawling techniques such as rate limiting, header validation, dynamic page generation, IP restrictions, cookies, and CAPTCHAs.

It then describes the essential considerations for building a crawler framework, including URL management, page downloading, scheduling, parsing, and data processing. The design aims to abstract repetitive tasks, provide extensibility, and support features like multi‑threaded downloading, XPath/CSS selectors, and customizable pipelines.

The framework’s basic characteristics are highlighted: easy customization, multi‑threaded download, and support for XPath and CSS parsing. An architecture diagram shows the flow: Engine → Scheduler → Downloader → Spider → Pipeline.

Several code snippets illustrate the implementation. A simple crawler without a framework is shown:

public class Reptile {
    public static void main(String[] args) {
        String url1 = "";
        InputStream is = null;
        BufferedReader br = null;
        StringBuffer html = new StringBuffer();
        String temp = "";
        try {
            URL url2 = new URL(url1);
            is = url2.openStream();
            br = new BufferedReader(new InputStreamReader(is));
            while ((temp = br.readLine()) != null) {
                html.append(temp);
            }
            if(is != null) { is.close(); is = null; }
            Document doc = Jsoup.parse(html.toString());
            Elements elements = doc.getElementsByClass("XX");
            for (Element element : elements) {
                System.out.println(element.text());
            }
        } catch (MalformedURLException e) { e.printStackTrace(); }
        catch (IOException e) { e.printStackTrace(); }
    }
}

Key framework components are then detailed:

URL Manager : maintains a FIFO queue of requests, encapsulating each URL into a Request object.

Downloader : abstracts the HTTP client (e.g., HttpClient or OkHttp) to fetch page content.

Scheduler : routes requests and responses between components. Example implementation uses blocking queues:

public class Scheduler {
    private BlockingQueue
pending = new LinkedBlockingQueue<>();
    private BlockingQueue
result = new LinkedBlockingQueue<>();
    public void addRequest(Request request) { try { pending.put(request); } catch (InterruptedException e) { log.error("Add request error", e); } }
    public void addResponse(Response response) { try { result.put(response); } catch (InterruptedException e) { log.error("Add response error", e); } }
    public boolean hasRequest() { return pending.size() > 0; }
    public Request nextRequest() { try { return pending.take(); } catch (InterruptedException e) { log.error("Get request error", e); return null; } }
    public boolean hasResponse() { return result.size() > 0; }
    public Response nextResponse() { try { return result.take(); } catch (InterruptedException e) { log.error("Get response error", e); return null; } }
    public void addRequests(List
requests) { requests.forEach(this::addRequest); }
}

An event‑driven design using the observer pattern is also shown:

public enum ElvesEvent { GLOBAL_STARTED, SPIDER_STARTED }
public class EventManager {
    private static final Map
>> elvesEventConsumerMap = new HashMap<>();
    public static void registerEvent(ElvesEvent elvesEvent, Consumer
consumer) { /* ... */ }
    public static void fireEvent(ElvesEvent elvesEvent, Config config) { /* ... */ }
}

Finally, a concrete spider example ( DoubanSpider ) demonstrates crawling movie titles from Douban, adding a pipeline to log results, and handling pagination by generating new requests.

public class DoubanSpider extends Spider {
    public DoubanSpider(String name) { super(name); this.startUrls("https://movie.douban.com/tag/爱情", /* ... */); }
    @Override public void onStart(Config config) { this.addPipeline((Pipeline
>)(item, request) -> log.info("Save to file: {}", item)); }
    public Result parse(Response response) { /* extract titles, handle next page */ }
}
public static void main(String[] args) { DoubanSpider spider = new DoubanSpider("豆瓣电影"); Elves.me(spider, Config.me()).start(); }

The article concludes that while the presented framework covers core functionality, further enhancements such as distributed crawling, fault tolerance, and dynamic page handling are possible, and invites contributions on the GitHub repository.

backendJavaArchitectureFrameworkweb scrapingScrapyweb crawler
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.