Java/Groovy Web Crawler for Collecting Ebook Metadata and Storing It in MySQL

The article presents a Java/Groovy script that crawls a website offering curated e‑books, extracts book details such as title, author, and download links using HTTP requests and regular expressions, and then inserts the collected information into a MySQL database.

FunTester
FunTester
FunTester
Java/Groovy Web Crawler for Collecting Ebook Metadata and Storing It in MySQL

While searching for e‑books, the author discovered a site providing many curated electronic books and decided to write a crawler to gather their titles and download links, noting that the canvas background effects on the site heavily tax Chrome resources.

The following Java/Groovy code is shared as a rough example of how to perform the crawling, parse the HTML with regular expressions, and store the results in a MySQL table.

package com.fun
import com.fun.db.mysql.MySqlTest
import com.fun.frame.httpclient.FanLibrary
import com.fun.utils.Regex
import org.slf4j.Logger
import org.slf4j.LoggerFactory

class T extends FanLibrary {
    static Logger logger = LoggerFactory.getLogger(T.class)
    public static void main(String[] args) {
        //test(322)
        def list = 1..1000 as List
        list.each { x ->
            try {
                test(x)
            } catch (Exception e) {
                logger.error(x.toString())
                output(e)
            }
            logger.warn(x.toString())
            sleep(2000)
        }
        testOver()
    }
    //**** represents the website address
    static def test(int id) {
        def get = getHttpGet("https://****/books/" + id + ".html")
        def response = getHttpResponse(get)
        def string = response.getString("content")
        if (string.contains("您需求的文件不存在") || string.contains("页面未找到")) return
        output(string)
        def all = Regex.regexAll(string, "class=\"bookpic\"> <img title=\".*?\"").get(0)
        def all2 = Regex.regexAll(string, "content=\"内容简介.*?\"").get(0)
        def all3 = Regex.regexAll(string, "title=\"作者:.*?\"").get(0)
        def all40 = Regex.regexAll(string, "https://*******\\.cc/go\\.html\\?url=https{0,1}://.*?\\.ctfile\\.com/.*?\"")
        def all4 = all40.size() == 0 ? "" : all40.get(0)
        def all50 = Regex.regexAll(string, "https://******\\.cc/go\\.html\\?url=https{0,1}://pan\\.baidu\\.com/.*?\"")
        def all5 = all50.size() == 0 ? "" : all50.get(0)
        output(all, all2, all3, all4, all5)
        def name = all.substring(all.lastIndexOf("=") + 2, all.length() - 1)
        def author = all3.substring(all3.lastIndexOf("=") + 2, all3.length() - 1)
        def intro = all2.substring(all2.lastIndexOf("=") + 2, all2.length() - 1)
        def url1 = all4 == "" ? "" : all4.substring(all4.lastIndexOf("=") + 1, all4.length() - 1)
        def url2 = all5 == "" ? "" : all5.substring(all5.lastIndexOf("=") + 1, all5.length() - 1)
        output(name, author, intro, url1, url2)
        def sql = String.format("INSERT INTO books (name,author,intro,urlc,urlb,bookid) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d)", name, author, intro, url1, url2, id)
        MySqlTest.sendWork(sql)
    }
}

The author expresses satisfaction with the script’s performance and shows a screenshot of the resulting database entries.

Readers can reply with the keyword “电子书” to the public account to receive the website address and a CSV file containing the collected data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavamysqlHTTPregexGroovy
FunTester
Written by

FunTester

10k followers, 1k articles | completely useless

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.