Why Does My Python Scraper Save Only the Last Page? Fix It Now
This article walks through a common Python web‑scraping issue where data from multiple pages is overwritten, explains the root cause, and provides a complete, runnable script using requests, parsel, and openpyxl to correctly collect and store each page's information.
Preface
In a Python community chat a member asked why their crawler only saved data from the last page when scraping multiple pages. The problem was reproduced and the original code was shared.
Implementation
The following script demonstrates a working solution. It fetches each page, parses the food items, and writes the results to an Excel file.
# encoding: UTF-8
# create time: 2024/05/30/0030 16:26:03
import time
from urllib.parse import urljoin
import openpyxl
import requests
import parsel
from fake_useragent import UserAgent
def get_page(pages: int):
"""Send request and get page data.
:param pages: page number
:return: response data"""
url = f"https://www.xiachufang.com/category/40071/?page={pages}"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0'}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
parse_page(response)
else:
return "Request failed, status code: " + str(response.status_code)
except requests.ReadTimeout as e:
print("Request timed out", e.args[0])
time.sleep(2)
def parse_page(response):
"""Parse page data and return a list of items."""
items = []
lst = []
base_url = 'https://www.xiachufang.com'
html = parsel.Selector(response.text)
foods_list = html.css('div.info p.name')
count = 0
for li in range(0, len(foods_list)):
count += 1
lst.append([
count,
foods_list[li].css('a::text').extract()[0][16:-14].strip(),
urljoin(base_url, foods_list[li].css('a::attr(href)').extract()[0])
])
print(lst)
items.append(lst)
save1(items)
save_data(items)
def save1(items):
with open('data.txt', 'w', encoding='utf-8') as f:
for item in items:
for sub in item:
f.write(str(sub) + '
')
def save_data(items):
wb = openpyxl.Workbook()
ws = wb.active
ws.append(['ID', 'Dish Name', 'Link'])
for item in items:
for sub in item:
ws.append(sub)
wb.save('xiachufang_breakfast.xlsx')
def main() -> None:
total_pages = 3
for i in range(total_pages):
current_page = i + 1
get_page(current_page)
print("Current page:\t" + str(current_page))
time.sleep(2)
if __name__ == '__main__':
main()The original issue persisted because the script overwrote the output file on each iteration. The revised version accumulates results in a list, writes all entries to a single Excel workbook, and prints progress for each page.
Conclusion
The article demonstrates how to diagnose a missing‑page problem in a Python web scraper, presents a complete, functional script, and highlights best practices such as proper file handling and incremental data storage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
