Cracking Dazhong Dianping’s CSS Encryption: A Step‑by‑Step Web Scraping Guide
This article walks through the challenges of scraping Dazhong Dianping, explains how the site hides numeric data with custom CSS fonts, and provides a complete Python workflow—including HTTP requests, font extraction, glyph rendering, and OCR—to decode and retrieve the protected information.
Web scraping often faces two anti‑scraping strategies: identity verification that blocks bots at the gateway, and embedded mechanisms that make data extraction difficult. The author attempts to scrape data from Dazhong Dianping, a site known for sophisticated anti‑scraping techniques.
1. Basic Crawling
Simple requests can retrieve titles and menus. Example code:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import requests
from lxml import etree
header = {
"Accept": "application/json, text/javascript",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",
"Cookie": "cy=1; cye=shanghai; ..."
}
url = 'http://www.dianping.com/beijing/ch10/g34060o2'
response = requests.get(url, headers=header)
data = etree.HTML(response.text)
title = data.xpath('//*[@id="shop-all-list"]/ul/li[1]/div[2]/div[1]/a/@title')
print(title)The result shows that ordinary crawling works, but the site employs CSS encryption to hide key numbers.
2. CSS Encryption
Numeric fields are rendered as glyphs using a custom font. In the HTML they appear as entities like . The actual digits are stored in a .woff font file referenced by a CSS rule such as:
.shopNum{font-family:'PingFangSC-Regular-shopNum';}
@font-face{font-family:'PingFangSC-Regular-reviewTag';src:url("//s3plus.meituan.net/v1/mss_73a511b8f91f43d0bdae92584ea6330b/font/bc2c52b3.woff");}By locating the .woff file, we obtain the mapping between the encoded entity and the actual digit.
3. Processing the WOFF File
Download the font and convert it to XML using fontTools:
from fontTools.ttLib import TTFont
font = TTFont('e765.woff')
font.saveXML('e765.xml')Open the generated XML to find the glyph name (e.g., uniF784) and its coordinates.
4. Rendering Glyphs
Use matplotlib to plot the glyph coordinates and save the image:
from fontTools.ttLib import TTFont
import matplotlib.pyplot as plt
font = TTFont('f0d5.woff')
coords = font['glyf']['uniF0D5'].coordinates
x = [pt[0] for pt in coords]
y = [pt[1] for pt in coords]
plt.fill(x, y, color='k')
plt.axis('off')
plt.savefig('uniF0D5.png')
plt.show()The resulting image displays the hidden digit.
5. OCR Decoding
Apply OCR to the rendered image to obtain the numeric value:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
captcha = Image.open('uniF0D5.png')
result = pytesseract.image_to_string(captcha, lang='eng', config='--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789').strip()
print(result)With OCR the digit is recognized, eliminating the need for manual lookup.
6. Complete Workflow
The full automated process is:
Fetch the page, extract encoded entities and the associated .woff URL.
Download the font, convert to XML, and retrieve the glyph coordinates for each entity.
Render each glyph to an image using matplotlib.
Run OCR on the images to obtain the actual numbers.
For large‑scale scraping, caching the mapping between glyph coordinates and digits in a database can greatly speed up subsequent runs.
Images illustrating the process:
By following these steps, the hidden numeric data on Dazhong Dianping—and similar sites that use CSS font obfuscation—can be reliably extracted.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
