How to Build a Java/Groovy Web Crawler with Regex and MySQL Storage
This article demonstrates a Java‑based web crawler written in Groovy that uses regular‑expression parsing to retrieve paginated company data from a government portal, constructs SQL insert statements, and stores the results in MySQL, with full source code and structural screenshots.
Purpose
This script crawls a government portal, extracts company registration information page by page, and stores the data in a MySQL table.
Technical Stack
Groovy script extending a custom Java HTTP library ( FanLibrary)
JSON request payload built with net.sf.json.JSONObject Regular‑expression helper class Regex (source available on Gitee)
MySQL execution via
MySqlTest.sendWorkCrawler Workflow
1. List‑page request
The getPage(int page) method sends a POST request to http://www.***.gov.cn/eportal/ui?pageId=307900 with the following parameters:
params.put("filter_LIKE_QYMC", EMPTY);
params.put("filter_LIKE_YYZZZCH", EMPTY);
params.put("filter_LIKE_ZSBH", EMPTY);
params.put("filter_LIKE_XXDZ", EMPTY);
params.put("currentPage", page);
params.put("pageSize", 15);
params.put("OrderByField", EMPTY);
params.put("OrderByDesc", EMPTY);The response JSON field content contains an HTML table. A regular expression <td s.*?浏览 extracts each row that holds a link to a detail page. For each extracted row the script:
Finds the href attribute with Regex.getRegex(..., "href=\".*?\"") Removes the HTML entity amp; Calls getInfo with the cleaned URL
Sleeps 3 seconds to avoid hammering the server
2. Detail‑page parsing
The getInfo(String url) method builds the absolute URL by prefixing http://www.***.gov.cn, performs a GET request, and extracts the table rows that have the CSS class label using the pattern <td class=\"label\".*?\n.*?\n.*?\n.*?\n.*?\n.*?. The resulting list contains ten fields in the order:
Company name
Address
Registered capital
Registration number (sid)
Company type
Legal representative (man)
Business scope (paper)
Rating level
Supervising authority (gov)
Registration period (time)
Each field is cleaned by removing HTML tags, whitespace, and then split on the Chinese colon : to obtain the value. The registration period string is further split on ~ to produce start and end dates.
3. SQL generation and storage
The extracted values are interpolated into a parameterised INSERT statement:
INSERT INTO company(name,adress,money,sid,type,man,paper,level,gov,start,end)
VALUES ("%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s");The final SQL string is printed with output(sql) and executed on the target MySQL instance via MySqlTest.sendWork(sql). Any exception during parsing or insertion is caught and logged.
Page Structure (reference)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
