Crack Font-Based Anti‑Scraping: A Step‑by‑Step Python Guide
This article explains how font‑based anti‑scraping works, shows how to locate and download custom font files, decode their glyph mappings into a dictionary, and use the mappings to extract real data from a recruitment site with Python, Scrapy and MySQL.
Target Website
The URL of the target site is base64‑encoded for safety; decode it to obtain the actual address.
Font Anti‑Scraping
Font anti‑scraping is a technique that hides data by rendering it with a custom font; without the correct decoding, the displayed characters are unreadable.
Principle : Custom fonts replace certain characters on the page. If the font is not decoded correctly, the data cannot be extracted.
In HTML, the custom font is applied via @font-face:
@font-face{
font-family:"Name";
src:url('font_file_url');
url('font_file_url') format('type');
}Typical font files are .ttf, .eot, and .woff, with .woff being the most common.
Solving Font Anti‑Scraping
Two common approaches:
Manually extract the mapping between glyph codes and characters and store it in a dictionary.
Download the font file, convert it to XML, decode the mapping, and build the dictionary programmatically.
Practical Demonstration
Custom Font File Discovery
Open the target site in the browser, open Developer Tools, and go to the Network tab. Filter by Font resources; the custom font appears as a request starting with file. Copy its URL and download the file.
If the file cannot be opened, rename its extension to .woff and try again.
Font Mapping Extraction
Convert the downloaded .woff file to XML using TTFont:
def get_fontfile():
rand = round(random.uniform(0, 1), 17)
url = f'https://www.xxxxxx.com/interns/iconfonts/file?rand={rand}'
response = requests.get(url, headers=headers).content
with open('file.woff', 'wb') as f:
f.write(response)
font = TTFont('file.woff')
font.saveXML('file.xml')Parse the XML to build a mapping dictionary:
with open('file.xml') as f:
xml = f.read()
keys = re.findall('<map code="(0x.*?)" name="uni.*?"/>', xml)
values = re.findall('<map code="0x.*?" name="uni(.*?)"/>', xml)
for i in range(len(values)):
if len(values[i]) < 4:
values[i] = ('\u00' + values[i]).encode('utf-8').decode('unicode_escape')
else:
values[i] = ('\u' + values[i]).encode('utf-8').decode('unicode_escape')
word_dict = dict(zip(keys, values))Data Retrieval
Use the mapping dictionary to replace encoded glyphs in the page source and extract data with XPath:
def get_data(mapping, url):
response = requests.get(url, headers=headers).text.replace('&#', '0')
for key in mapping:
response = response.replace(key, mapping[key])
sel = parsel.Selector(response)
items = sel.xpath('//*[@id="__layout"]/div/div[2]/div[2]/div[1]/div[1]/div[1]/div')
for i in items:
data = {
'workname': i.xpath('./div[1]/div[1]/p[1]/a/text()').get(),
'link': i.xpath('./div[1]/div[1]/p[1]/a/@href').get(),
'salary': i.xpath('./div[1]/div[1]/p[1]/span/text()').get(),
'place': i.xpath('./div[1]/div[1]/p[2]/span[1]/text()').get(),
'work_time': i.xpath('./div[1]/div[1]/p[2]/span[3]/text()').get() + i.xpath('./div[1]/div[1]/p[2]/span[5]/text()').get(),
'company_name': i.xpath('./div[1]/div[2]/p[1]/a/text()').get(),
'field_scale': i.xpath('./div[1]/div[2]/p[2]/span[1]/text()').get() + i.xpath('./div[1]/div[2]/p[2]/span[3]/text()').get(),
'advantage': ','.join(i.xpath('./div[2]/div[1]/span/text()').getall()),
'welfare': ','.join(i.xpath('./div[2]/div[2]/span/text()').getall())
}
saving_data(list(data.values()))Saving to Database
def saving_data(data):
db = pymysql.connect(host=host, user=user, password=passwd, port=port, db='recruit')
cursor = db.cursor()
sql = 'INSERT INTO recruit_data(work_name, link, salary, place, work_time, company_name, field_scale, advantage, welfare) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)'
try:
cursor.execute(sql, data)
db.commit()
except:
db.rollback()
db.close()Running the Program
if __name__ == '__main__':
create_db()
get_fontfile()
for i in range(1, 3):
url = f'https://www.xxxxxx.com/interns?page={i}&type=intern&salary=-0&city=%E5%85%A8%E5%9B%BD'
get_data(get_dict(), url)Result
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
