Crack Font-Based Anti‑Scraping: A Step‑by‑Step Python Guide

This article explains how font‑based anti‑scraping works, shows how to locate and download custom font files, decode their glyph mappings into a dictionary, and use the mappings to extract real data from a recruitment site with Python, Scrapy and MySQL.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Crack Font-Based Anti‑Scraping: A Step‑by‑Step Python Guide

Target Website

The URL of the target site is base64‑encoded for safety; decode it to obtain the actual address.

Font Anti‑Scraping

Font anti‑scraping is a technique that hides data by rendering it with a custom font; without the correct decoding, the displayed characters are unreadable.

Principle : Custom fonts replace certain characters on the page. If the font is not decoded correctly, the data cannot be extracted.

In HTML, the custom font is applied via @font-face:

@font-face{
font-family:"Name";
src:url('font_file_url');
url('font_file_url') format('type');
}

Typical font files are .ttf, .eot, and .woff, with .woff being the most common.

Solving Font Anti‑Scraping

Two common approaches:

Manually extract the mapping between glyph codes and characters and store it in a dictionary.

Download the font file, convert it to XML, decode the mapping, and build the dictionary programmatically.

Practical Demonstration

Custom Font File Discovery

Open the target site in the browser, open Developer Tools, and go to the Network tab. Filter by Font resources; the custom font appears as a request starting with file. Copy its URL and download the file.

If the file cannot be opened, rename its extension to .woff and try again.

Font Mapping Extraction

Convert the downloaded .woff file to XML using TTFont:

def get_fontfile():
    rand = round(random.uniform(0, 1), 17)
    url = f'https://www.xxxxxx.com/interns/iconfonts/file?rand={rand}'
    response = requests.get(url, headers=headers).content
    with open('file.woff', 'wb') as f:
        f.write(response)
    font = TTFont('file.woff')
    font.saveXML('file.xml')

Parse the XML to build a mapping dictionary:

with open('file.xml') as f:
    xml = f.read()
keys = re.findall('<map code="(0x.*?)" name="uni.*?"/>', xml)
values = re.findall('<map code="0x.*?" name="uni(.*?)"/>', xml)
for i in range(len(values)):
    if len(values[i]) < 4:
        values[i] = ('\u00' + values[i]).encode('utf-8').decode('unicode_escape')
    else:
        values[i] = ('\u' + values[i]).encode('utf-8').decode('unicode_escape')
word_dict = dict(zip(keys, values))

Data Retrieval

Use the mapping dictionary to replace encoded glyphs in the page source and extract data with XPath:

def get_data(mapping, url):
    response = requests.get(url, headers=headers).text.replace('&#', '0')
    for key in mapping:
        response = response.replace(key, mapping[key])
    sel = parsel.Selector(response)
    items = sel.xpath('//*[@id="__layout"]/div/div[2]/div[2]/div[1]/div[1]/div[1]/div')
    for i in items:
        data = {
            'workname': i.xpath('./div[1]/div[1]/p[1]/a/text()').get(),
            'link': i.xpath('./div[1]/div[1]/p[1]/a/@href').get(),
            'salary': i.xpath('./div[1]/div[1]/p[1]/span/text()').get(),
            'place': i.xpath('./div[1]/div[1]/p[2]/span[1]/text()').get(),
            'work_time': i.xpath('./div[1]/div[1]/p[2]/span[3]/text()').get() + i.xpath('./div[1]/div[1]/p[2]/span[5]/text()').get(),
            'company_name': i.xpath('./div[1]/div[2]/p[1]/a/text()').get(),
            'field_scale': i.xpath('./div[1]/div[2]/p[2]/span[1]/text()').get() + i.xpath('./div[1]/div[2]/p[2]/span[3]/text()').get(),
            'advantage': ','.join(i.xpath('./div[2]/div[1]/span/text()').getall()),
            'welfare': ','.join(i.xpath('./div[2]/div[2]/span/text()').getall())
        }
        saving_data(list(data.values()))

Saving to Database

def saving_data(data):
    db = pymysql.connect(host=host, user=user, password=passwd, port=port, db='recruit')
    cursor = db.cursor()
    sql = 'INSERT INTO recruit_data(work_name, link, salary, place, work_time, company_name, field_scale, advantage, welfare) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)'
    try:
        cursor.execute(sql, data)
        db.commit()
    except:
        db.rollback()
    db.close()

Running the Program

if __name__ == '__main__':
    create_db()
    get_fontfile()
    for i in range(1, 3):
        url = f'https://www.xxxxxx.com/interns?page={i}&type=intern&salary=-0&city=%E5%85%A8%E5%9B%BD'
        get_data(get_dict(), url)

Result

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqlData ExtractionWeb ScrapingScrapyFont Anti‑Scraping
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.