Designing Anti‑Scraping Techniques Using Custom Base64 Encoding
This article explains how to hide real intentions behind visible actions by using text obfuscation and custom Base64‑like encoding to defeat standard web scrapers, detailing the underlying principles, decoding challenges, and Python implementations of a flexible Custom64 encoder.
"Ming repair the road, darkly cross the river" is a Chinese idiom describing the tactic of showing obvious actions while concealing true intent, a principle widely used in anti‑scraping where visible data (e.g., comment counts) are rendered by the browser and hidden from crawlers.
The article introduces common anti‑scraping methods such as text obfuscation, dynamic rendering, and code obfuscation, illustrating with an example where the displayed comment count "3803" is actually rendered as unreadable boxes for bots.
Readers are challenged to decode the string GBHDHJGOGDGJGOHD== , which appears to be Base64 but yields nonsensical output, prompting investigation of the JavaScript that generates the signing key for request parameters.
To understand the obstacle, the article reviews the Base64 encoding process defined in RFC 4648, outlining the five steps: converting characters to ASCII, grouping into 24‑bit blocks, splitting into 6‑bit groups, converting to decimal, and mapping to the Base64 alphabet.
An example encodes the word "async" to "YXN5bmM=", demonstrating the reversible nature of standard Base64.
To break this reversibility, a custom encoder named Custom64 is introduced. By altering the bit‑group size (e.g., using 5‑ or 4‑bit groups) and providing a configurable threshold, the encoder produces strings that look like Base64 but cannot be decoded correctly with standard tools.
The Python implementation of Custom64 is shown, highlighting the use of a dictionary for the encoding table, encode() and decode() methods, and the ability to switch between standard and custom modes via a parameter.
When encoding the string "asyncins" with Custom64 (threshold ≠ 6), the output is GBHDHJGOGDGJGOHD== . A scraper that applies ordinary Base64 decoding will fail, forcing the attacker to spend time on a dead‑end.
The article notes that similar customizations can be applied to other algorithms such as MD5, AES, or SHA‑256, and emphasizes the importance of understanding both encoding and decoding mechanisms for robust anti‑scraping design.
Finally, the piece acknowledges its source from the book "Python3 Anti‑Scraping Principles and Practical Bypass" and includes promotional material for a giveaway of the book to readers.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.