Java/Groovy Web Crawler for Collecting Ebook Metadata and Storing It in MySQL
The article presents a Java/Groovy script that crawls a website offering curated e‑books, extracts book details such as title, author, and download links using HTTP requests and regular expressions, and then inserts the collected information into a MySQL database.
While searching for e‑books, the author discovered a site providing many curated electronic books and decided to write a crawler to gather their titles and download links, noting that the canvas background effects on the site heavily tax Chrome resources.
The following Java/Groovy code is shared as a rough example of how to perform the crawling, parse the HTML with regular expressions, and store the results in a MySQL table.
package com.fun
import com.fun.db.mysql.MySqlTest
import com.fun.frame.httpclient.FanLibrary
import com.fun.utils.Regex
import org.slf4j.Logger
import org.slf4j.LoggerFactory
class T extends FanLibrary {
static Logger logger = LoggerFactory.getLogger(T.class)
public static void main(String[] args) {
//test(322)
def list = 1..1000 as List
list.each { x ->
try {
test(x)
} catch (Exception e) {
logger.error(x.toString())
output(e)
}
logger.warn(x.toString())
sleep(2000)
}
testOver()
}
//**** represents the website address
static def test(int id) {
def get = getHttpGet("https://****/books/" + id + ".html")
def response = getHttpResponse(get)
def string = response.getString("content")
if (string.contains("您需求的文件不存在") || string.contains("页面未找到")) return
output(string)
def all = Regex.regexAll(string, "class=\"bookpic\"> <img title=\".*?\"").get(0)
def all2 = Regex.regexAll(string, "content=\"内容简介.*?\"").get(0)
def all3 = Regex.regexAll(string, "title=\"作者:.*?\"").get(0)
def all40 = Regex.regexAll(string, "https://*******\\.cc/go\\.html\\?url=https{0,1}://.*?\\.ctfile\\.com/.*?\"")
def all4 = all40.size() == 0 ? "" : all40.get(0)
def all50 = Regex.regexAll(string, "https://******\\.cc/go\\.html\\?url=https{0,1}://pan\\.baidu\\.com/.*?\"")
def all5 = all50.size() == 0 ? "" : all50.get(0)
output(all, all2, all3, all4, all5)
def name = all.substring(all.lastIndexOf("=") + 2, all.length() - 1)
def author = all3.substring(all3.lastIndexOf("=") + 2, all3.length() - 1)
def intro = all2.substring(all2.lastIndexOf("=") + 2, all2.length() - 1)
def url1 = all4 == "" ? "" : all4.substring(all4.lastIndexOf("=") + 1, all4.length() - 1)
def url2 = all5 == "" ? "" : all5.substring(all5.lastIndexOf("=") + 1, all5.length() - 1)
output(name, author, intro, url1, url2)
def sql = String.format("INSERT INTO books (name,author,intro,urlc,urlb,bookid) VALUES (\"%s\",\"%s\",\"%s\",\"%s\",\"%s\",%d)", name, author, intro, url1, url2, id)
MySqlTest.sendWork(sql)
}
}The author expresses satisfaction with the script’s performance and shows a screenshot of the resulting database entries.
Readers can reply with the keyword “电子书” to the public account to receive the website address and a CSV file containing the collected data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
