Bypassing Implicit Style‑CSS Anti‑Scraping: Analysis and Restoration of Obfuscated Content
This article explains how many Chinese web sites use hidden CSS ::before content to hide characters, shows how to locate the relevant network request, decode the span class mappings from obfuscated JavaScript, and restore the original text for successful web scraping.
Many front‑line internet products (e.g., automotive news sites, novel sites) employ a CSS‑based anti‑scraping technique that hides characters using ::before pseudo‑elements. The article begins by opening the target URL https://implicit-style-css_0.crawler‑lab.com and observing the rendered page.
Using the browser’s Network panel, only two resources are loaded: an HTML document and a JavaScript file. The HTML response contains a div.rdtext with p elements, but the displayed text differs from the source because several characters are replaced by span tags with class attributes such as context_kw0 , context_kw21 , etc.
<code><span>夜幕团队 NightTeam 于 2019 年 9 月 9 日正式成立,团队由爬虫领域中实力强劲的多名开发者组成:...</span></code>The response shows the same sentence with many span placeholders:
<code><p>夜幕团队 NightTeam 于 <span>2019</span> 年 <span>9</span> 月 <span>9</span> 日正式成立<span class="context_kw0"></span>团队由爬虫领域中实力强劲<span class="context_kw1"></span>多<span class="context_kw21"></span>开发者组成:...</code>Each placeholder corresponds to a character defined in CSS rules like:
<code>.context_kw0::before { content: ","; }</code> <code>.context_kw21::before { content: "名"; }</code>Thus the missing characters can be recovered by mapping the class name to its content value. The mapping follows the pattern context_kw + number , which can be represented as a dictionary, e.g.:
<code>{0: ",", 1: "的", 21: "名"}</code>Searching the JavaScript reveals the generation of these mappings. The code is heavily obfuscated and uses an array _0xa12e that is AES‑decrypted, then processed to produce the words array:
<code>var secWords = decrypted[_0xea12('0x16')](CryptoJS['enc']['Utf8'])[ _0xea12('0x17') ](',');
var words = new Array(secWords[_0xea12('0x18')]);</code>The script then iterates over words and injects CSS rules such as .context_kw + i + _0xea12('0x2c') with the appropriate content value. To run the script in a Node environment, the window and document references must be commented out, for example:
<code>try {
if (top[_0xea12('0x10')][_0xea12('0x11')][_0xea12('0x12')] != window[_0xea12('0x11')]['href']) {
top['window'][_0xea12('0x11')]['href'] = window[_0xea12('0x11')][_0xea12('0x12')];
}
}</code>After adjusting the script, the extracted characters can be substituted back into the HTML, fully restoring the original page content.
The article also includes a simple demonstration of the ::before pseudo‑element:
<code><style>
q::before { content: "«"; color: blue; }
q::after { content: "»"; color: red; }
</style>
<q>大家好,我是咸鱼</q>,<q>我是 NightTeam 的一员</q>
</code>When rendered, the text appears with blue « before and red » after each q element, illustrating how hidden characters can be injected via CSS.
In summary, the article introduces the implicit Style‑CSS anti‑scraping technique, walks through a concrete example of locating the hidden characters, decoding the obfuscated JavaScript, and restoring the original text, providing a practical guide for dealing with this kind of anti‑scraping measure.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.