Cracking Dazhong Dianping’s CSS Encryption: A Step‑by‑Step Web Scraping Guide

This article walks through the challenges of scraping Dazhong Dianping, explains how the site hides numeric data with custom CSS fonts, and provides a complete Python workflow—including HTTP requests, font extraction, glyph rendering, and OCR—to decode and retrieve the protected information.

21CTO
21CTO
21CTO
Cracking Dazhong Dianping’s CSS Encryption: A Step‑by‑Step Web Scraping Guide

Web scraping often faces two anti‑scraping strategies: identity verification that blocks bots at the gateway, and embedded mechanisms that make data extraction difficult. The author attempts to scrape data from Dazhong Dianping, a site known for sophisticated anti‑scraping techniques.

1. Basic Crawling

Simple requests can retrieve titles and menus. Example code:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import requests
from lxml import etree

header = {
    "Accept": "application/json, text/javascript",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",
    "Cookie": "cy=1; cye=shanghai; ..."
}
url = 'http://www.dianping.com/beijing/ch10/g34060o2'
response = requests.get(url, headers=header)
data = etree.HTML(response.text)
title = data.xpath('//*[@id="shop-all-list"]/ul/li[1]/div[2]/div[1]/a/@title')
print(title)

The result shows that ordinary crawling works, but the site employs CSS encryption to hide key numbers.

2. CSS Encryption

Numeric fields are rendered as glyphs using a custom font. In the HTML they appear as entities like . The actual digits are stored in a .woff font file referenced by a CSS rule such as:

.shopNum{font-family:'PingFangSC-Regular-shopNum';}
@font-face{font-family:'PingFangSC-Regular-reviewTag';src:url("//s3plus.meituan.net/v1/mss_73a511b8f91f43d0bdae92584ea6330b/font/bc2c52b3.woff");}

By locating the .woff file, we obtain the mapping between the encoded entity and the actual digit.

3. Processing the WOFF File

Download the font and convert it to XML using fontTools:

from fontTools.ttLib import TTFont
font = TTFont('e765.woff')
font.saveXML('e765.xml')

Open the generated XML to find the glyph name (e.g., uniF784) and its coordinates.

4. Rendering Glyphs

Use matplotlib to plot the glyph coordinates and save the image:

from fontTools.ttLib import TTFont
import matplotlib.pyplot as plt

font = TTFont('f0d5.woff')
coords = font['glyf']['uniF0D5'].coordinates
x = [pt[0] for pt in coords]
y = [pt[1] for pt in coords]
plt.fill(x, y, color='k')
plt.axis('off')
plt.savefig('uniF0D5.png')
plt.show()

The resulting image displays the hidden digit.

5. OCR Decoding

Apply OCR to the rendered image to obtain the numeric value:

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

captcha = Image.open('uniF0D5.png')
result = pytesseract.image_to_string(captcha, lang='eng', config='--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789').strip()
print(result)

With OCR the digit is recognized, eliminating the need for manual lookup.

6. Complete Workflow

The full automated process is:

Fetch the page, extract encoded entities and the associated .woff URL.

Download the font, convert to XML, and retrieve the glyph coordinates for each entity.

Render each glyph to an image using matplotlib.

Run OCR on the images to obtain the actual numbers.

For large‑scale scraping, caching the mapping between glyph coordinates and digits in a database can greatly speed up subsequent runs.

Images illustrating the process:

By following these steps, the hidden numeric data on Dazhong Dianping—and similar sites that use CSS font obfuscation—can be reliably extracted.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonOCRWeb Scrapinganti-scrapingfontToolsCSS encryption
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.