How to Fix Common Python Web Scraping URL Errors and Build a Recipe Crawler
This article walks through a real-world Python web‑scraping issue where an extra slash in the URL caused failures, demonstrates debugging with try/except, shows a complete recipe‑crawling script using requests, lxml, and UserAgent, and explains proper URL joining with urllib.parse.urljoin.
Introduction
In a recent Python community discussion a user asked about a web‑scraping problem. The author shares the issue and a step‑by‑step solution to help readers learn how to troubleshoot similar errors.
Problem Description
The target page’s URL was constructed with an extra /, causing the request to fail and raising an index‑out‑of‑range error when using xpath selectors.
Solution Process
The fix is to remove the stray slash from the URL string and add a short sleep between requests to avoid being blocked. Below is the complete Python script that crawls recipes from https://www.xiachufang.com, extracts the step number, dish name, ingredients, and saves the data to a text file.
import requests
from lxml import etree
from fake_useragent import UserAgent
import time
class kitchen(object):
u = 0
def __init__(self):
self.url = "https://www.xiachufang.com/category/40076/"
ua = UserAgent(verify_ssl=False)
for i in range(1, 50):
self.headers = {'User-Agent': ua.random}
def get_page(self, url):
res = requests.get(url=url, headers=self.headers)
html = res.content.decode("utf-8")
time.sleep(2)
return html
def parse_page(self, html):
parse_html = etree.HTML(html)
image_src_list = parse_html.xpath('//li/div/a/@href')
for i in image_src_list:
try:
url = "https://www.xiachufang.com" + i
html1 = self.get_page(url)
parse_html1 = etree.HTML(html1)
num = parse_html1.xpath('.//h2[@id="steps"]/text()')[0].strip()
name = parse_html1.xpath('.//li[@class="container"]/p/text()')
ingredients = parse_html1.xpath('.//td//a/text()')
self.u += 1
food_info = "
第 %s 种
菜 名 : %s
原 料 : %s
下载 链 接 : %s
=============================================================
" % (str(self.u), num, ingredients, url)
f = open('下厨房菜谱.txt', 'a', encoding='utf-8')
f.write(str(food_info))
print(str(food_info))
f.close()
except:
print('xpath没获取到内容!')
def main(self):
startPage = int(input("起始页:"))
endPage = int(input("终止页:"))
for page in range(startPage, endPage + 1):
url = self.url.format(page)
html = self.get_page(url)
self.parse_page(html)
time.sleep(2.4)
print("====================================第 %s 页 爬 取 成 功====================================" % page)
if __name__ == '__main__':
imageSpider = kitchen()
imageSpider.main()The script writes each recipe’s information to a .txt file, as shown in the screenshot below.
Using urljoin for Robust URL Construction
When concatenating URLs, it is safer to use urljoin from urllib.parse to correctly handle missing or extra slashes.
from urllib.parse import urljoin
source_url = 'https://www.baidu.com/'
child_url1 = '/robots.txt'
child_url2 = 'robots.txt'
final_url1 = urljoin(source_url, child_url1)
final_url2 = urljoin(source_url, child_url2)
print(final_url1)
print(final_url2)Conclusion
The article identifies a common URL‑construction mistake in Python web scraping, provides a full working crawler, and recommends using urljoin for reliable URL handling, helping readers avoid similar pitfalls.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
