How to Efficiently Scrape Novel Rankings with Python: De‑duplication and Speed Tips
This guide explains how to extract novel titles and links from a structured ranking website, remove duplicate entries using a set, handle HTML tags, and improve crawling speed with multithreading or the Scrapy framework, all while keeping the code modular and reusable.
1. Goal
Retrieve the address of the ranking page (http://www.qu.la/paihangbang/), find each novel's name and its link on the site.
2. Observe the Webpage Structure
Each category is wrapped in a consistent HTML block, making it easy for a crawler to locate and collect all novel links, then store them in a list.
3. Small Trick for List De‑duplication
Even across different categories, the same novel may appear multiple times, wasting resources when crawling large volumes. A single line of code solves this: unique_list = set(original_list) Using a set ensures the list contains no duplicate elements.
4. Code Implementation
Modular, functional programming is recommended: write each independent feature as a separate function for simplicity and reusability.
1. Web Scraping Header
2. Get Ranking Novels and Their Links
Iterate through each type of novel ranking, write the results sequentially to a file (novel name + novel link), store the content in a list, and return a list filled with URL links.
3. Get All Chapter Links of a Single Novel
Obtain each chapter's URL and create a novel file.
4. Get Single Page Content and Save Locally
When downloading files, they often contain formatting tags like <br/>. A simple method removes them: html = get_html(url).replace('<br/>', '\n') This replaces <br/> with a newline for proper paragraph breaks.
5. Drawbacks
The crawler works smoothly because the target site lacks anti‑scraping measures and has a clean, well‑structured layout. However, crawling an entire novel (≈1000 pages) takes about 8.5 minutes, and all 60 novels in the ranking would require roughly 8.5 hours with a single‑threaded approach.
To boost speed, consider writing a multithreaded module or, better yet, using the Scrapy framework, which can accelerate crawling by dozens or even hundreds of times.
6. Main Function
7. Output Results
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
