Scraping HTML Tables with Java Regex and Generating SQL Inserts
The article walks through a Java solution for extracting multilingual data from an HTML table using regular expressions, handling spacing and encoding issues, splitting fields, and constructing INSERT statements to populate a country_code database table.
Problem Overview
A web API returns an HTML page that contains a table of country‑code pairs. The response is a plain string, making traditional DOM parsing with XPath cumbersome. The goal is to extract each row and generate MySQL INSERT statements.
Challenges
Table cells contain multilingual text (e.g., Chinese, Arabic, Cyrillic).
Whitespace between the country name and its code is inconsistent.
Incorrect character encoding can produce garbled output.
Solution Approach
The implementation uses regular expressions to locate each table row, removes remaining HTML tags, splits the cleaned line on the first space (limiting the array size to two elements), and formats the values into INSERT statements.
Key Steps
Fetch the HTML content as a string.
Apply a regex that matches a complete <tr>…</tr> block (including nested lines) and collect all matches.
For each match, strip all tags with replaceAll("<.+?>", "") and remove line‑break characters.
Split the resulting text on the first space ( split(" ", 2)) to separate the country name and its ISO code.
Generate a MySQL statement using String.format and output it.
Core Java Code
public static void main(String[] args) {
String url = "https://docs.oracle.com/cd/E13214_01/wli/docs92/xref/xqisocodes.html";
HttpGet httpGet = getHttpGet(url);
JSONObject httpResponse = getHttpResponse(httpGet);
String content = httpResponse.getString("content");
// Regular expression that captures a full table row
List<String> rows = regexAll(content,
"<tr.+?</a>" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + "</div>");
for (String row : rows) {
// Remove all HTML tags and line breaks
String cleaned = row.replaceAll("<.+?>", "").replaceAll(LINE, "");
// Split on the first space only
String[] parts = cleaned.split(" ", 2);
String sql = "INSERT country_code (country,code) VALUES (\"%s\",\"%s\");";
System.out.println(String.format(sql,
parts[0].replace(SPACE_1, ""),
parts[1].replace(SPACE_1, "")));
}
testOver();
}
/**
* Return all matching items for a given regular expression.
* @param text Text to be searched
* @param regex Regular expression
* @return List of matches
*/
public static List<String> regexAll(String text, String regex) {
List<String> result = new ArrayList<>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
result.add(matcher.group());
}
return result;
}Sample Output
INSERT country_code (country,code) VALUES ("German","de");
INSERT country_code (country,code) VALUES ("Greek","el");
INSERT country_code (country,code) VALUES ("Greenlandic","kl");
INSERT country_code (country,code) VALUES ("Guarani","gn");
INSERT country_code (country,code) VALUES ("Hausa","ha");
INSERT country_code (country,code) VALUES ("Hebrew","he");
INSERT country_code (country,code) VALUES ("Hindi","hi");
INSERT country_code (country,code) VALUES ("Hungarian","hu");
INSERT country_code (country,code) VALUES ("Icelandic","is");
INSERT country_code (country,code) VALUES ("Indonesian","id");
INSERT country_code (country,code) VALUES ("Interlingua","ia");
INSERT country_code (country,code) VALUES ("Interlingue","ie");
INSERT country_code (country,code) VALUES ("Inuktitut","iu");
INSERT country_code (country,code) VALUES ("Inupiak","ik");
INSERT country_code (country,code) VALUES ("Irish","ga");
INSERT country_code (country,code) VALUES ("Italian","it");
INSERT country_code (country,code) VALUES ("Japanese","ja");This approach provides a lightweight way to scrape tabular data without a full HTML parser while handling multilingual content, irregular whitespace, and encoding issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
