Bypass SVG Anti‑Scraping and Extract Data with Selenium and requests‑html
This article explains how to scrape data protected by SVG background‑image anti‑scraping by using Selenium to retrieve the SVG URL, parsing the SVG with requests‑html to map background offsets to characters, replacing SVG nodes with text, and finally extracting structured information such as phone numbers and reviews.
The article demonstrates a practical method for extracting data from websites that use SVG background‑image anti‑scraping techniques.
SVG Anti‑Scraping Example
A simple practice site (http://www.porters.vip/confusion/food.html) is used to illustrate the approach.
The page displays text using SVG images with CSS background‑position offsets.
To extract the data, Selenium is first used to open the page and locate the element that contains the SVG background image.
from selenium import webdriver
browser = webdriver.Chrome()
url = 'http://www.porters.vip/confusion/food.html#'
browser.get(url)The SVG URL is obtained from the CSS property:
d_tag = browser.find_element_by_css_selector('d[class^="vhk"]')
background_image_url = d_tag.value_of_css_property("background-image")
svg_url = background_image_url[5:-2]
svg_urlThe requests-html library (installable via pip install requests-html) is then used to download and parse the SVG file.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(svg_url)
xs = []
ys = []
data = []
for text_tag in r.html.xpath(r"//text"):
if not xs:
xs.extend(map(int, text_tag.xpath('.//@x')[0].split()))
ys.append(int(text_tag.xpath('.//@y')[0]))
data.append(list(text_tag.xpath('.//text()')[0]))
print(xs)
print(ys)
print(data)Each SVG text node’s background-position can be read with Selenium, and the numeric offsets are mapped to characters using the previously collected xs, ys and data arrays.
import re
d_tags = browser.find_elements_by_css_selector('.more d[class^="vhk"]')
for d_tag in d_tags:
position = d_tag.value_of_css_property("background-position")
x, y = map(int, re.findall("\d+", position))
print(position, x, y)Using bisect to locate the correct character:
from bisect import bisect
data[bisect(ys, 15)][bisect(xs, 8)]A helper function replaces each SVG node with its corresponding text character via JavaScript.
def parseAndReplaceSvgNode(d_tags):
for d_tag in d_tags:
position = d_tag.value_of_css_property("background-position")
x, y = map(int, re.findall("\d+", position))
num = data[bisect(ys, y)][bisect(xs, x)]
browser.execute_script(f"""
var element = arguments[0];
element.parentNode.replaceChild(document.createTextNode('{num}'), element);
""", d_tag)
d_tags = browser.find_elements_by_css_selector('.more d[class^="vhk"]')
parseAndReplaceSvgNode(d_tags)After replacement, the page’s textual information can be extracted directly:
title = browser.find_element_by_class_name("title").text
comment = browser.find_element_by_class_name("comments").text
avgPrice = browser.find_element_by_class_name('avgPriceTitle').text
comment_score_tags = browser.find_elements_by_css_selector('.comment_score .item')
taste = comment_score_tags[0].text
environment = comment_score_tags[1].text
service = comment_score_tags[2].text
address = browser.find_element_by_css_selector('.address .address_detail').text
characteristic = browser.find_element_by_css_selector('.characteristic .info-name').text
phone = browser.find_element_by_class_name("more").text
print(title, comment, avgPrice, taste, environment, service, address, characteristic, phone)The script outputs a complete record, for example:
柳州螺蛳粉 100条评论 人均:12 口味:8.7 环境:7.4 服务:7.6 中山大道浦西路28号商铺 特色:脆爽酸笋,热辣红油,香葱萝卜,吃完还想吃 电话:400-51771This workflow reliably extracts the hidden textual data from SVG‑based anti‑scraping mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
