How to Build a Java/Groovy Web Crawler with Regex and MySQL Storage

This article demonstrates a Java‑based web crawler written in Groovy that uses regular‑expression parsing to retrieve paginated company data from a government portal, constructs SQL insert statements, and stores the results in MySQL, with full source code and structural screenshots.

FunTester
FunTester
FunTester
How to Build a Java/Groovy Web Crawler with Regex and MySQL Storage

Purpose

This script crawls a government portal, extracts company registration information page by page, and stores the data in a MySQL table.

Technical Stack

Groovy script extending a custom Java HTTP library ( FanLibrary)

JSON request payload built with net.sf.json.JSONObject Regular‑expression helper class Regex (source available on Gitee)

MySQL execution via

MySqlTest.sendWork

Crawler Workflow

1. List‑page request

The getPage(int page) method sends a POST request to http://www.***.gov.cn/eportal/ui?pageId=307900 with the following parameters:

params.put("filter_LIKE_QYMC", EMPTY);
params.put("filter_LIKE_YYZZZCH", EMPTY);
params.put("filter_LIKE_ZSBH", EMPTY);
params.put("filter_LIKE_XXDZ", EMPTY);
params.put("currentPage", page);
params.put("pageSize", 15);
params.put("OrderByField", EMPTY);
params.put("OrderByDesc", EMPTY);

The response JSON field content contains an HTML table. A regular expression <td s.*?浏览 extracts each row that holds a link to a detail page. For each extracted row the script:

Finds the href attribute with Regex.getRegex(..., "href=\".*?\"") Removes the HTML entity amp; Calls getInfo with the cleaned URL

Sleeps 3 seconds to avoid hammering the server

2. Detail‑page parsing

The getInfo(String url) method builds the absolute URL by prefixing http://www.***.gov.cn, performs a GET request, and extracts the table rows that have the CSS class label using the pattern <td class=\"label\".*?\n.*?\n.*?\n.*?\n.*?\n.*?. The resulting list contains ten fields in the order:

Company name

Address

Registered capital

Registration number (sid)

Company type

Legal representative (man)

Business scope (paper)

Rating level

Supervising authority (gov)

Registration period (time)

Each field is cleaned by removing HTML tags, whitespace, and then split on the Chinese colon to obtain the value. The registration period string is further split on ~ to produce start and end dates.

3. SQL generation and storage

The extracted values are interpolated into a parameterised INSERT statement:

INSERT INTO company(name,adress,money,sid,type,man,paper,level,gov,start,end)
VALUES ("%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s");

The final SQL string is printed with output(sql) and executed on the target MySQL instance via MySqlTest.sendWork(sql). Any exception during parsing or insertion is caught and logged.

Page Structure (reference)

First page structure
First page structure
Second page detail structure
Second page detail structure
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavamysqlData ExtractionregexGroovyWeb Crawler
FunTester
Written by

FunTester

10k followers, 1k articles | completely useless

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.