Mastering Xpath Selectors in Scrapy: Extract Precise Data from Web Pages
This tutorial walks you through using Scrapy's Xpath selectors to locate and extract titles, dates, comments, and content from web pages, demonstrating both manual and browser‑assisted methods, and shows how to integrate the expressions into your Scrapy spider for reliable data harvesting.
Previously we introduced how to start a Scrapy project and shared some crawling tips; you can refer to those earlier articles if you missed them.
In this article we focus on using Xpath selectors within Scrapy to extract target information such as title, date, theme, comment count, and body text from HTML pages, using the Berlu Online site as an example.
Open the target website and randomly select an article to inspect.
Write the basic Scrapy spider code, ensuring start_urls points to the specific article URL.
Open the browser's developer tools (F12 or right‑click → Inspect) to view the page source.
Click the element‑selection icon to hover over page elements and locate the desired tag, such as the <h1> title.
Copy the Xpath of the selected element via right‑click → Copy → Copy Xpath; for example, //*[@id="post-113659"]/div[1]/h1.
Insert the copied Xpath into the Scrapy spider and run Debug on main.py to verify that the selector returns the expected content; both manually written and copied Xpaths should yield the same data.
To extract only the text inside the <h1> tag, append the text() function to the Xpath expression.
In summary, Xpath expressions are not unique; different syntaxes can retrieve the same data as long as they conform to Xpath rules, and the text() function is commonly combined with Xpath in Scrapy to extract node text.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
