Scrape Unlimited Novel Site and Auto‑Download Texts with Python
This tutorial explains how to use Python's requests, lxml, and fake_useragent libraries to crawl the free novel website "无限小说网", extract each novel's download link, and automatically download the corresponding text files.
1. Introduction
With the rise of online reading, many people prefer web novels, but most require payment. This tutorial shows how to crawl the free novel site "无限小说网" and directly download the text files.
2. Project Goal
Obtain the download link for a given novel and download the corresponding .txt file.
3. Preparation
Tools: PyCharm. Required libraries: requests, lxml, fake_useragent. Target URL pattern: https://www.555x.org/html/wuxiaxianxia/list_29_{page}.html where {page} is the page number.
4. Implementation
4.1 Define spider class
import requests
from lxml import etree
from fake_useragent import UserAgent
import time
class xiaoshuo(object):
def __init__(self):
self.url = "https://www.xiachufang.com/explore/?page={}"
def main(self):
pass
if __name__ == '__main__':
spider = xiaoshuo()
spider.main()4.2 Random User‑Agent
for i in range(1, 50):
self.headers = {'User-Agent': ua.random}4.3 Request page
def get_page(self, url):
res = requests.get(url=url, headers=self.headers)
html = res.content.decode("utf-8")
return html4.4 Parse first‑level page with XPath
Use browser developer tools to locate the second‑level page URL, then the third‑level download button URL.
4.5 Parse third‑level page and extract title and download link
for rd in three:
b = rd.xpath('..//div[@class="shutou"]//b/text()')[0].strip()
tress = rd.xpath('..//div[@class="shuji"]//ul/li/a/@href')[0].strip()
# print(tress)4.6 Save result
read = '''《%s》 下载链接 : %s ''' % (b, tress)
print(read)4.7 Execute workflow
html = self.get_page(url)
self.parse_page(html)5. Result
Run the script, input start and end pages, and the console displays each novel’s title and download URL. Clicking the link downloads the text file, which can be opened locally.
6. Conclusion
Avoid excessive crawling to reduce server load.
The project demonstrates how to use Python’s requests, lxml and fake_useragent to fetch novel download links.
Hands‑on practice helps deepen understanding of web scraping.
Request the source code by replying "小说".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
