Java Web Crawler Framework with JD Book Data Extraction Example
This article introduces a modular Java web crawler framework built with Maven, explains its package structure (db, model, util, parse, main), details the data flow from URL fetching to HTML parsing using HttpClient and Jsoup, and provides a complete example that extracts book information from JD.com and stores it in MySQL.
Java Web Crawler Framework
Writing a web crawler requires a clear logical order; this article explains a frequently used sequence and presents a framework that can be extended for both simple and complex crawlers.
Framework Packages
The Maven‑based project follows the Spring MVC style and contains five packages:
db : database utilities, including MyDataSource for driver registration and connection configuration, and MYSQLControl for insert, update, and table creation operations.
model : POJOs that encapsulate the data to be scraped, e.g., book ID, name, and price.
util : HTTP client helpers that fetch HTML or JSON from a given URL.
parse : parsers that process the fetched content; Jsoup is used for HTML, while fastjson or regular expressions can handle JSON.
main : the entry point that orchestrates URL fetching, parsing, and database insertion.
Crawler Logic Flow
The main method passes a URL to the util layer, which returns the raw HTML; the parse layer extracts the required data into a collection; the main method then uses the db layer to persist the data into MySQL.
Example: JD.com Book Information Crawler
The example demonstrates how to scrape book ID, name, and price from JD.com. First, define the data model:
package model;
public class JdModel {
private String bookID;
private String bookName;
private String bookPrice;
public String getBookID() { return bookID; }
public void setBookID(String bookID) { this.bookID = bookID; }
public String getBookName() { return bookName; }
public void setBookName(String bookName) { this.bookName = bookName; }
public String getBookPrice() { return bookPrice; }
public void setBookPrice(String bookPrice) { this.bookPrice = bookPrice; }
}The main class launches the crawl:
package main;
import java.util.List;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.http.client.HttpClient;
import org.apache.http.impl.client.DefaultHttpClient;
import db.MYSQLControl;
import model.JdModel;
import util.URLFecter;
public class JdongMain {
static final Log logger = LogFactory.getLog(JdongMain.class);
public static void main(String[] args) throws Exception {
HttpClient client = new DefaultHttpClient();
String url = "http://search.jd.com/Search?keyword=Python&enc=utf-8&book=y&wq=Python&pvid=33xo9lni.p4a1qb";
List
bookdatas = URLFecter.URLParser(client, url);
for (JdModel jd : bookdatas) {
logger.info("bookID:" + jd.getBookID() + "\t" + "bookPrice:" + jd.getBookPrice() + "\t" + "bookName:" + jd.getBookName());
}
MYSQLControl.executeInsert(bookdatas);
}
}The util layer fetches the raw HTML:
package util;
import java.io.IOException;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.HttpVersion;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.message.BasicHttpResponse;
public abstract class HTTPUtils {
public static HttpResponse getRawHtml(HttpClient client, String personalUrl) {
HttpGet getMethod = new HttpGet(personalUrl);
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK");
try {
response = client.execute(getMethod);
} catch (IOException e) {
e.printStackTrace();
} finally {
// getMethod.abort();
}
return response;
}
}URLFecter parses the response and delegates to the parser:
package util;
import java.util.ArrayList;
import java.util.List;
import org.apache.http.HttpResponse;
import org.apache.http.util.EntityUtils;
import model.JdModel;
import parse.JdParse;
public class URLFecter {
public static List
URLParser(HttpClient client, String url) throws Exception {
List
JingdongData = new ArrayList
();
HttpResponse response = HTTPUtils.getRawHtml(client, url);
int StatusCode = response.getStatusLine().getStatusCode();
if (StatusCode == 200) {
String entity = EntityUtils.toString(response.getEntity(), "utf-8");
JingdongData = JdParse.getData(entity);
EntityUtils.consume(response.getEntity());
} else {
EntityUtils.consume(response.getEntity());
}
return JingdongData;
}
}The parser uses Jsoup to extract book details:
package parse;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import model.JdModel;
public class JdParse {
public static List
getData(String html) throws Exception {
List
data = new ArrayList
();
Document doc = Jsoup.parse(html);
Elements elements = doc.select("ul[class=gl-warp clearfix]").select("li[class=gl-item]");
for (Element ele : elements) {
String bookID = ele.attr("data-sku");
String bookPrice = ele.select("div[class=p-price]").select("strong").select("i").text();
String bookName = ele.select("div[class=p-name]").select("em").text();
JdModel jdModel = new JdModel();
jdModel.setBookID(bookID);
jdModel.setBookName(bookName);
jdModel.setBookPrice(bookPrice);
data.add(jdModel);
}
return data;
}
}The db layer provides a DataSource and batch insert operation:
package db;
import javax.sql.DataSource;
import org.apache.commons.dbcp2.BasicDataSource;
public class MyDataSource {
public static DataSource getDataSource(String connectURI) {
BasicDataSource ds = new BasicDataSource();
ds.setDriverClassName("com.mysql.jdbc.Driver");
ds.setUsername("root");
ds.setPassword("112233");
ds.setUrl(connectURI);
return ds;
}
} package db;
import java.sql.SQLException;
import java.util.List;
import javax.sql.DataSource;
import org.apache.commons.dbutils.QueryRunner;
import model.JdModel;
public class MYSQLControl {
static DataSource ds = MyDataSource.getDataSource("jdbc:mysql://127.0.0.1:3306/moviedata");
static QueryRunner qr = new QueryRunner(ds);
public static void executeInsert(List
jingdongdata) throws SQLException {
Object[][] params = new Object[jingdongdata.size()][3];
for (int i = 0; i < params.length; i++) {
params[i][0] = jingdongdata.get(i).getBookID();
params[i][1] = jingdongdata.get(i).getBookName();
params[i][2] = jingdongdata.get(i).getBookPrice();
}
qr.batch("insert into jingdongbook (bookID, bookName, bookPrice) values (?,?,?)", params);
System.out.println("执行数据库完毕!成功插入数据:" + jingdongdata.size() + "条");
}
}Running the program produces console logs of each book record and inserts the data into the MySQL table, as shown in the screenshots of the crawler output and the database contents.
Overall, the article provides a complete, extensible Java web crawler framework, demonstrates the end‑to‑end process of fetching, parsing, and persisting data, and can serve as a foundation for more advanced crawling tasks.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.