Scraping HTML Tables with Java Regex and Generating SQL Inserts

The article walks through a Java solution for extracting multilingual data from an HTML table using regular expressions, handling spacing and encoding issues, splitting fields, and constructing INSERT statements to populate a country_code database table.

FunTester
FunTester
FunTester
Scraping HTML Tables with Java Regex and Generating SQL Inserts

Problem Overview

A web API returns an HTML page that contains a table of country‑code pairs. The response is a plain string, making traditional DOM parsing with XPath cumbersome. The goal is to extract each row and generate MySQL INSERT statements.

Challenges

Table cells contain multilingual text (e.g., Chinese, Arabic, Cyrillic).

Whitespace between the country name and its code is inconsistent.

Incorrect character encoding can produce garbled output.

Solution Approach

The implementation uses regular expressions to locate each table row, removes remaining HTML tags, splits the cleaned line on the first space (limiting the array size to two elements), and formats the values into INSERT statements.

Key Steps

Fetch the HTML content as a string.

Apply a regex that matches a complete <tr>…</tr> block (including nested lines) and collect all matches.

For each match, strip all tags with replaceAll("<.+?>", "") and remove line‑break characters.

Split the resulting text on the first space ( split(" ", 2)) to separate the country name and its ISO code.

Generate a MySQL statement using String.format and output it.

Core Java Code

public static void main(String[] args) {
    String url = "https://docs.oracle.com/cd/E13214_01/wli/docs92/xref/xqisocodes.html";
    HttpGet httpGet = getHttpGet(url);
    JSONObject httpResponse = getHttpResponse(httpGet);
    String content = httpResponse.getString("content");

    // Regular expression that captures a full table row
    List<String> rows = regexAll(content,
        "<tr.+?</a>" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + ".+" + LINE + "</div>");

    for (String row : rows) {
        // Remove all HTML tags and line breaks
        String cleaned = row.replaceAll("<.+?>", "").replaceAll(LINE, "");
        // Split on the first space only
        String[] parts = cleaned.split(" ", 2);
        String sql = "INSERT country_code (country,code) VALUES (\"%s\",\"%s\");";
        System.out.println(String.format(sql,
            parts[0].replace(SPACE_1, ""),
            parts[1].replace(SPACE_1, "")));
    }
    testOver();
}

/**
 * Return all matching items for a given regular expression.
 * @param text  Text to be searched
 * @param regex Regular expression
 * @return List of matches
 */
public static List<String> regexAll(String text, String regex) {
    List<String> result = new ArrayList<>();
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        result.add(matcher.group());
    }
    return result;
}

Sample Output

INSERT country_code (country,code) VALUES ("German","de");
INSERT country_code (country,code) VALUES ("Greek","el");
INSERT country_code (country,code) VALUES ("Greenlandic","kl");
INSERT country_code (country,code) VALUES ("Guarani","gn");
INSERT country_code (country,code) VALUES ("Hausa","ha");
INSERT country_code (country,code) VALUES ("Hebrew","he");
INSERT country_code (country,code) VALUES ("Hindi","hi");
INSERT country_code (country,code) VALUES ("Hungarian","hu");
INSERT country_code (country,code) VALUES ("Icelandic","is");
INSERT country_code (country,code) VALUES ("Indonesian","id");
INSERT country_code (country,code) VALUES ("Interlingua","ia");
INSERT country_code (country,code) VALUES ("Interlingue","ie");
INSERT country_code (country,code) VALUES ("Inuktitut","iu");
INSERT country_code (country,code) VALUES ("Inupiak","ik");
INSERT country_code (country,code) VALUES ("Irish","ga");
INSERT country_code (country,code) VALUES ("Italian","it");
INSERT country_code (country,code) VALUES ("Japanese","ja");

This approach provides a lightweight way to scrape tabular data without a full HTML parser while handling multilingual content, irregular whitespace, and encoding issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendJavaSQLData ExtractionregexWeb Scraping
FunTester
Written by

FunTester

10k followers, 1k articles | completely useless

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.