How to Fix Common URL Construction Errors in Python Web Scraping
This article walks through a typical Python web‑scraping issue caused by incorrect URL concatenation and XPath mismatches, explains the debugging process, and provides complete, ready‑to‑run code along with a practical urljoin example to prevent similar errors.
Introduction
Hello, I’m PiPi. A follower asked a Python web‑scraping question in a chat group, so I’m sharing the problem and solution here.
Problem
The target page structure had changed, causing XPath selectors to miss elements and raise list‑index‑out‑of‑range errors. A simple try block avoided the crash but still returned no data.
Further inspection revealed that the constructed url contained an extra “/”, which broke the request.
Solution
Removing the stray slash and adding a short sleep between requests fixes the issue. Below is the complete script that crawls recipes from the site and saves them to a text file.
import requests
from lxml import etree
from fake_useragent import UserAgent
import time
class kitchen(object):
u = 0
def __init__(self):
self.url = "https://www.xiachufang.com/category/40076/"
ua = UserAgent(verify_ssl=False)
for i in range(1, 50):
self.headers = {'User-Agent': ua.random}
def get_page(self, url):
res = requests.get(url=url, headers=self.headers)
html = res.content.decode("utf-8")
time.sleep(2)
return html
def parse_page(self, html):
parse_html = etree.HTML(html)
image_src_list = parse_html.xpath('//li/div/a/@href')
for i in image_src_list:
try:
url = "https://www.xiachufang.com" + i
html1 = self.get_page(url)
parse_html1 = etree.HTML(html1)
num = parse_html1.xpath('.//h2[@id="steps"]/text()')[0].strip()
name = parse_html1.xpath('.//li[@class="container"]/p/text()')
ingredients = parse_html1.xpath('.//td//a/text()')
self.u += 1
food_info = "
第 %s 种
菜 名 : %s
原 料 : %s
下载 链 接 : %s
" % (str(self.u), num, ingredients, url)
f = open('下厨房菜谱.txt', 'a', encoding='utf-8')
f.write(str(food_info))
print(str(food_info))
f.close()
except:
print('xpath没获取到内容!')
def main(self):
startPage = int(input("起始页:"))
endPage = int(input("终止页:"))
for page in range(startPage, endPage + 1):
url = self.url.format(page)
html = self.get_page(url)
self.parse_page(html)
time.sleep(2.4)
print("====================================第 %s 页 爬 取 成 功====================================" % page)
if __name__ == '__main__':
imageSpider = kitchen()
imageSpider.main()Result
The script writes each recipe’s details to 下厨房菜谱.txt. An example of the saved output is shown below.
URL‑joining tip
When constructing URLs, use urljoin to combine a base URL with a relative path safely. This avoids missing or duplicate slashes.
from urllib.parse import urljoin
source_url = 'https://www.baidu.com/'
child_url1 = '/robots.txt'
child_url2 = 'robots.txt'
final_url1 = urljoin(source_url, child_url1)
final_url2 = urljoin(source_url, child_url2)
print(final_url1)
print(final_url2) urljoinfills in missing parts from the base URL; if the second argument is an absolute URL, it takes precedence.
Conclusion
The article identified a common web‑scraping error—incorrect URL concatenation—provided a step‑by‑step fix, demonstrated a full Python crawler, and highlighted the usefulness of urljoin for reliable URL construction.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
