Backend Development 12 min read

Java Web Crawler Framework with JD Book Data Extraction Example

This article introduces a modular Java web crawler framework built with Maven, explains its package structure (db, model, util, parse, main), details the data flow from URL fetching to HTML parsing using HttpClient and Jsoup, and provides a complete example that extracts book information from JD.com and stores it in MySQL.

Java Captain

Aug 4, 2018

Java Web Crawler Framework

Writing a web crawler requires a clear logical order; this article explains a frequently used sequence and presents a framework that can be extended for both simple and complex crawlers.

Framework Packages

The Maven‑based project follows the Spring MVC style and contains five packages:

db : database utilities, including MyDataSource for driver registration and connection configuration, and MYSQLControl for insert, update, and table creation operations.

model : POJOs that encapsulate the data to be scraped, e.g., book ID, name, and price.

util : HTTP client helpers that fetch HTML or JSON from a given URL.

parse : parsers that process the fetched content; Jsoup is used for HTML, while fastjson or regular expressions can handle JSON.

main : the entry point that orchestrates URL fetching, parsing, and database insertion.

Crawler Logic Flow

The main method passes a URL to the util layer, which returns the raw HTML; the parse layer extracts the required data into a collection; the main method then uses the db layer to persist the data into MySQL.

Example: JD.com Book Information Crawler

The example demonstrates how to scrape book ID, name, and price from JD.com. First, define the data model:

package model;

public class JdModel {
    private String bookID;
    private String bookName;
    private String bookPrice;
    public String getBookID() { return bookID; }
    public void setBookID(String bookID) { this.bookID = bookID; }
    public String getBookName() { return bookName; }
    public void setBookName(String bookName) { this.bookName = bookName; }
    public String getBookPrice() { return bookPrice; }
    public void setBookPrice(String bookPrice) { this.bookPrice = bookPrice; }
}

The main class launches the crawl:

package main;
import java.util.List;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.http.client.HttpClient;
import org.apache.http.impl.client.DefaultHttpClient;
import db.MYSQLControl;
import model.JdModel;
import util.URLFecter;

public class JdongMain {
    static final Log logger = LogFactory.getLog(JdongMain.class);
    public static void main(String[] args) throws Exception {
        HttpClient client = new DefaultHttpClient();
        String url = "http://search.jd.com/Search?keyword=Python&enc=utf-8&book=y&wq=Python&pvid=33xo9lni.p4a1qb";
        List<JdModel> bookdatas = URLFecter.URLParser(client, url);
        for (JdModel jd : bookdatas) {
            logger.info("bookID:" + jd.getBookID() + "\t" + "bookPrice:" + jd.getBookPrice() + "\t" + "bookName:" + jd.getBookName());
        }
        MYSQLControl.executeInsert(bookdatas);
    }
}

The util layer fetches the raw HTML:

package util;
import java.io.IOException;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.HttpVersion;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.message.BasicHttpResponse;

public abstract class HTTPUtils {
    public static HttpResponse getRawHtml(HttpClient client, String personalUrl) {
        HttpGet getMethod = new HttpGet(personalUrl);
        HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK");
        try {
            response = client.execute(getMethod);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            // getMethod.abort();
        }
        return response;
    }
}

URLFecter parses the response and delegates to the parser:

package util;
import java.util.ArrayList;
import java.util.List;
import org.apache.http.HttpResponse;
import org.apache.http.util.EntityUtils;
import model.JdModel;
import parse.JdParse;

public class URLFecter {
    public static List<JdModel> URLParser(HttpClient client, String url) throws Exception {
        List<JdModel> JingdongData = new ArrayList<JdModel>();
        HttpResponse response = HTTPUtils.getRawHtml(client, url);
        int StatusCode = response.getStatusLine().getStatusCode();
        if (StatusCode == 200) {
            String entity = EntityUtils.toString(response.getEntity(), "utf-8");
            JingdongData = JdParse.getData(entity);
            EntityUtils.consume(response.getEntity());
        } else {
            EntityUtils.consume(response.getEntity());
        }
        return JingdongData;
    }
}

The parser uses Jsoup to extract book details:

package parse;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import model.JdModel;

public class JdParse {
    public static List<JdModel> getData(String html) throws Exception {
        List<JdModel> data = new ArrayList<JdModel>();
        Document doc = Jsoup.parse(html);
        Elements elements = doc.select("ul[class=gl-warp clearfix]").select("li[class=gl-item]");
        for (Element ele : elements) {
            String bookID = ele.attr("data-sku");
            String bookPrice = ele.select("div[class=p-price]").select("strong").select("i").text();
            String bookName = ele.select("div[class=p-name]").select("em").text();
            JdModel jdModel = new JdModel();
            jdModel.setBookID(bookID);
            jdModel.setBookName(bookName);
            jdModel.setBookPrice(bookPrice);
            data.add(jdModel);
        }
        return data;
    }
}

The db layer provides a DataSource and batch insert operation:

package db;
import javax.sql.DataSource;
import org.apache.commons.dbcp2.BasicDataSource;

public class MyDataSource {
    public static DataSource getDataSource(String connectURI) {
        BasicDataSource ds = new BasicDataSource();
        ds.setDriverClassName("com.mysql.jdbc.Driver");
        ds.setUsername("root");
        ds.setPassword("112233");
        ds.setUrl(connectURI);
        return ds;
    }
}

package db;
import java.sql.SQLException;
import java.util.List;
import javax.sql.DataSource;
import org.apache.commons.dbutils.QueryRunner;
import model.JdModel;

public class MYSQLControl {
    static DataSource ds = MyDataSource.getDataSource("jdbc:mysql://127.0.0.1:3306/moviedata");
    static QueryRunner qr = new QueryRunner(ds);
    public static void executeInsert(List<JdModel> jingdongdata) throws SQLException {
        Object[][] params = new Object[jingdongdata.size()][3];
        for (int i = 0; i < params.length; i++) {
            params[i][0] = jingdongdata.get(i).getBookID();
            params[i][1] = jingdongdata.get(i).getBookName();
            params[i][2] = jingdongdata.get(i).getBookPrice();
        }
        qr.batch("insert into jingdongbook (bookID, bookName, bookPrice) values (?,?,?)", params);
        System.out.println("执行数据库完毕！成功插入数据：" + jingdongdata.size() + "条");
    }
}

Running the program produces console logs of each book record and inserts the data into the MySQL table, as shown in the screenshots of the crawler output and the database contents.

Overall, the article provides a complete, extensible Java web crawler framework, demonstrates the end‑to‑end process of fetching, parsing, and persisting data, and can serve as a foundation for more advanced crawling tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Java maven MySQL jsoup HttpClient WebCrawler

Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.