Automate Daily Report Downloads with a Python Web Scraper – Full Code Explained
This article walks through building a Python web scraper that automatically logs in, fetches daily free reports from a registration‑required site, saves each as a PDF with appropriate filenames, and includes complete source code with detailed explanations for handling retries, cookies, and file management.
In a Python community, the author was asked to create a script that automatically collects and downloads the free reports posted each day on a website that requires phone‑number registration, saving each report as a PDF named after its title.
1. Introduction
The task was initially thought to involve reverse engineering, but the site turned out to be straightforward, leading to a quick development of a web‑scraping solution.
2. Implementation
The core code uses requests for HTTP requests, parsel.Selector for parsing, and Windows registry to locate the desktop folder. It handles headers, cookies, retries, and error logging. The script iterates over pages, extracts report titles, download links, and saves the files to a folder named “今日研报” on the desktop, ensuring duplicate files are skipped.
# -*- coding:utf-8 -*-
"""
开发环境:Python 3.86
脚本名称:2023-07-07 报告厅baogaoting
创建日期:2023年07月07日
"""
import datetime
import os, sys, time, traceback
import pathlib
from parsel import Selector
import requests
headers = {"xxx"}
cookies = {"xxx"}
def _down_file(dow_url, file_name):
"""
:param dow_url: 下载链接
:param file_name: 图片绝对路径 包括后缀名
:return:
"""
re_重试次数 = 0
while True:
try:
response = requests.get(dow_url, headers=headers, cookies=cookies, timeout=10)
if response.status_code == 200:
break
elif response.status_code == 404:
break
else:
print(response.status_code)
re_重试次数 += 1
except Exception as e:
adress = sys.exc_info()[-1]
line_error = traceback.extract_tb(adress, limit=1)[-1][1]
print(f"===================
【异常原因】:{e}
【异常类型】:{type(e)}
【异常位置】:{line_error}
===================")
time.sleep(5)
re_重试次数 += 1
if re_重试次数 >= 10:
response = ""
break
if response:
with open(file_name, "wb+") as f:
f.write(response.content)
else:
print(f"【跳过】:下载失败{file_name}")
import winreg
def get_desktop():
key = winreg.OpenKey(winreg.HKEY_CURRENT_USER, r'Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\Shell Folders')
return winreg.QueryValueEx(key, "Desktop")[0]
if __name__ == '__main__':
zm_path = get_desktop()
x_date = time.strftime("%Y-%m-%d")
list_file_path = list(pathlib.Path(x_date).rglob("*"))
dict_file_path = {i.name: True for i in list_file_path}
url = "https://www.baogaoting.com/space/30909237"
for page in range(1, 3):
params = {"page": page, "size": "15"}
response = requests.get(url, headers=headers, cookies=cookies, params=params)
# ... (omitted for brevity)
print(f"【{title}】:热度{h3} 是当天上传的资料,准备访问下载{href_download}")
if href_download:
if not os.path.exists(f"{zm_path}//今日研报"):
os.makedirs(f"{zm_path}//今日研报", exist_ok=True)
title = title + pathlib.Path(href_download).suffix
for k in ["<", ">", "|", '"', "*", '\\', ":", "/", "?", "
", "\r", "\t", "!", "☆"]:
date = date.replace(k, '').strip()
if dict_file_path.get(f"{date}_{title}"):
print("【状态】:已经下载了哦,自动跳过")
continue
else:
_down_file(href_download, f"{zm_path}//今日研报//{title}")
else:
print(f"【状态】:链接{href_download} {title}无效,跳过不下载")
time.sleep(0.5)
input("【状态】:完成了哦,按任意键退出软件")3. Conclusion
The script successfully automates the daily download of research reports, eliminating the need to manually locate and save each file.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
