Backend Development 5 min read

Java/Groovy Web Crawler for Collecting Ebook Metadata and Storing It in MySQL

The article presents a Java/Groovy script that crawls a website offering curated e‑books, extracts book details such as title, author, and download links using HTTP requests and regular expressions, and then inserts the collected information into a MySQL database.

FunTester

Oct 17, 2019

Java/Groovy Web Crawler for Collecting Ebook Metadata and Storing It in MySQL

While searching for e‑books, the author discovered a site providing many curated electronic books and decided to write a crawler to gather their titles and download links, noting that the canvas background effects on the site heavily tax Chrome resources.

The following Java/Groovy code is shared as a rough example of how to perform the crawling, parse the HTML with regular expressions, and store the results in a MySQL table.

package com.fun
import com.fun.db.mysql.MySqlTest
import com.fun.frame.httpclient.FanLibrary
import com.fun.utils.Regex
import org.slf4j.Logger
import org.slf4j.LoggerFactory

class T extends FanLibrary {
    static Logger logger = LoggerFactory.getLogger(T.class)
    public static void main(String[] args) {
        //test(322)
        def list = 1..1000 as List
        list.each { x ->
            try {
                test(x)
            } catch (Exception e) {
                logger.error(x.toString())
                output(e)
            }
            logger.warn(x.toString())
            sleep(2000)
        }
        testOver()
    }
    //**** represents the website address
    static def test(int id) {
        def get = getHttpGet("https://****/books/" + id + ".html")
        def response = getHttpResponse(get)
        def string = response.getString("content")
        if (string.contains("您需求的文件不存在") || string.contains("页面未找到")) return
        output(string)
        def all = Regex.regexAll(string, "class=\"bookpic\"> <img title=\".*?\"").get(0)
        def all2 = Regex.regexAll(string, "content=\"内容简介.*?\"").get(0)
        def all3 = Regex.regexAll(string, "title=\"作者：.*?\"").get(0)
        def all40 = Regex.regexAll(string, "https://*******\\.cc/go\\.html\\?url=https{0,1}://.*?\\.ctfile\\.com/.*?\"")
        def all4 = all40.size() == 0 ? "" : all40.get(0)
        def all50 = Regex.regexAll(string, "https://******\\.cc/go\\.html\\?url=https{0,1}://pan\\.baidu\\.com/.*?\"")
        def all5 = all50.size() == 0 ? "" : all50.get(0)
        output(all, all2, all3, all4, all5)
        def name = all.substring(all.lastIndexOf("=") + 2, all.length() - 1)
        def author = all3.substring(all3.lastIndexOf("=") + 2, all3.length() - 1)
        def intro = all2.substring(all2.lastIndexOf("=") + 2, all2.length() - 1)
        def url1 = all4 == "" ? "" : all4.substring(all4.lastIndexOf("=") + 1, all4.length() - 1)
        def url2 = all5 == "" ? "" : all5.substring(all5.lastIndexOf("=") + 1, all5.length() - 1)
        output(name, author, intro, url1, url2)
        def sql = String.format("INSERT INTO books (name,author,intro,urlc,urlb,bookid) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d)", name, author, intro, url1, url2, id)
        MySqlTest.sendWork(sql)
    }
}

The author expresses satisfaction with the script’s performance and shows a screenshot of the resulting database entries.

Readers can reply with the keyword “电子书” to the public account to receive the website address and a CSV file containing the collected data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java MySQL HTTP regex Groovy

Written by

FunTester

10k followers, 1k articles | completely useless

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.