How to Build a Java HttpClient Spider for Scraping Movie Details and Download Links
This article explains how to update and use a Java HttpClient‑based spider that removes duplicate links, handles legacy page formats, extracts movie metadata and download URLs (magnet, ed2k, Baidu Pan), and stores the results in a MySQL database, with complete source code examples.
The update focuses on fixing two main issues: old‑page download links that appear as Thunder or FTP URLs, and duplicate entries caused by recommendation lists on each page. Two new helper methods were added to handle these cases.
Spider Entry Point
The spider(int pa) method obtains a list of page URLs via getPage(pa), removes a predefined set of URLs (the abc array), and then de‑duplicates the remaining URLs using a HashSet. Each unique URL is processed in a loop where getMovieInfo(p) is called, followed by a random sleep to avoid throttling.
Fetching Page Lists
The overloaded getPage methods accept either an int page number or a raw HTML string. For the numeric version, the URL is built as http://www.***.net/ys/index_ + page + .htm, with a special case for page 1. The method performs an HTTP GET using a pre‑configured HttpGet, parses the JSON response to obtain the content field, converts it to UTF‑8, and extracts all detail page links with the regular expression "http://www.***.net/ys/\d+/\d+.htm".
Extracting Movie Information
The core getMovieInfo(String url) method sends an HTTP request to the detail page, checks for a “content not found” message, and then converts the response to a UTF‑8 string. It locates the block starting with the Chinese marker ◎ and ending before <hr, then extracts fields such as name, translated name, year, language, release date, score, length, and director using the helper getInfo with specific start strings (e.g., "片 名 ").
If the marker is absent, the method falls back to extracting the title from the <title> tag and other fields using alternative patterns (e.g., "片长: ", "上映日期: ", "导演: ", "语言: ").
Collecting Download Links
Three regular expressions are used to gather different download URLs from the page content:
Magnet links: magnet:.+?> ED2K/FTP/Thunder links: ed2k:.+?>, ftp://.+?>, thunder://.+?> Baidu Pan links: http(s)*://pan.baidu.com/.+?</td> If no ED2K links are found, the method attempts to match FTP or Thunder URLs as fallbacks.
Storing Results in MySQL
The extracted fields and link collections are formatted into an INSERT statement:
INSERT INTO movie (name,tname,year,language,date,score,length,author,magnet,ed2k,pan) VALUES("%s","%s","%s","%s","%s","%s","%s","%s","%s","%s","%s");The placeholders are replaced with the actual values, and the link lists are converted to strings with quotation marks stripped. The final SQL command is sent to the database via MySqlTest.sendWork(sql). Debug output prints the lengths of the link strings and the full SQL for verification.
Utility Method
The getInfo(String text, String start) helper uses a regular expression to locate the first occurrence of start followed by any characters up to a '<' character, then trims the start marker and the trailing '<' to return the clean value.
Running the Spider
A simple main method sets the default charset to GB2312, invokes spider(1) ten times, and calls testOver() to finalize the run. The example demonstrates how to batch‑process a range of list pages and store the harvested movie data.
Overall, the article provides a complete, reproducible example of a Java‑based web crawler that handles duplicate removal, legacy link formats, character‑encoding issues, and data persistence, making it a useful reference for developers building similar scraping tools.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
