Fundamentals 12 min read

How to Bypass Anti‑Scraping Defenses and Extract Hidden Prices with Selenium and OCR

This article demonstrates step‑by‑step how to overcome a website’s anti‑scraping defenses using Selenium with stealth options, retrieve CSS‑based price images, reconstruct the digits, and apply Tesseract OCR to accurately extract numeric data, providing complete Python code snippets throughout.

Python Crawling & Data Mining

Nov 26, 2021

How to Bypass Anti‑Scraping Defenses and Extract Hidden Prices with Selenium and OCR

In this tutorial the author discovers a hotel booking site that displays price digits using CSS background images (each digit 8×17 pixels). Direct requests return heavily obfuscated JavaScript, so Selenium is used with stealth settings to appear as a regular browser.

Stealth Selenium Setup

from selenium.webdriver import ChromeOptions
from selenium import webdriver
browser = webdriver.Chrome()

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
option.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
option.add_argument("--disable-blink-features=AutomationControlled")
browser = webdriver.Chrome(options=option)

with open('stealth.min.js') as f:
    js = f.read()
browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': js})
url = 'http://hotels.huazhu.com/inthotel/detail/9005308'
browser.get(url)

The page loads, but price information may not appear immediately; refreshing a few times can help.

Reveal All Prices

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(browser, 10)

table = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#Pdetail_part2 table')))
table.location_once_scrolled_into_view

more_click = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#Pdetail_part2 a[class="viewallprice"]')))
more_click.click()

After clicking, all price rows become visible.

Parse CSS to Obtain Digit Images

img_url = None
for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
    name = tr.find_element_by_tag_name("h3").text
    print(name)
    price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
    for var in price.find_elements_by_tag_name("var"):
        if img_url is None:
            img_url = var.value_of_css_property("background-image")[5:-2]
            print(img_url)
        position = var.value_of_css_property("background-position")
        w, h = map(lambda x: int(x[1:-2]), position.split())
        print(w, h)

The CSS provides a single sprite image (background‑image) and the background‑position for each digit.

Download the Sprite Image

import requests
from io import BytesIO
import base64
from PIL import Image

def download_img(img_url):
    cookies = {o['name']: o['value'] for o in browser.get_cookies()}
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive",
        "Host": "hotels.huazhu.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
    }
    for _ in range(10):
        r = requests.get(img_url, headers=headers, cookies=cookies)
        if r.status_code == 200:
            break
    else:
        return None
    img = Image.open(BytesIO(r.content))
    return img

img = download_img(img_url)
img

Crop Digits and Assemble the Price Image

img_url = None
for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
    name = tr.find_element_by_tag_name("h3").text
    print(name)
    price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
    var_el_s = price.find_elements_by_tag_name("var")
    n = len(var_el_s)
    target = Image.new('RGB', (10 * n, 17), color=(255, 255, 255))
    for i, var in enumerate(var_el_s):
        if img_url is None:
            img_url = var.value_of_css_property("background-image")[5:-2]
            img = download_img(img_url)
        position = var.value_of_css_property("background-position")
        w, h = map(lambda x: int(x[1:-2]), position.split())
        r = img.crop((w, h, w+8, h+17))
        target.paste(r, (10*i, 0), r)
    display(target)

The assembled image shows the full price as a single picture.

Image Binarization for Better OCR

def image_binarization(im, threshold=250):
    Lim = im.convert("L")
    table = [0 if i < threshold else 1 for i in range(256)]
    return Lim.point(table, "1")

image_binarization(target)

Install and Use Tesseract OCR

Install the Python wrapper and the OCR engine: pip install pytesseract Download and install Tesseract‑OCR from the official repository, add its bin directory to the system PATH, and verify with tesseract -v.

Recognize the Price

import pytesseract

text = pytesseract.image_to_string(image_binarization(target)).strip()
print(text)

The OCR reliably returns the numeric price (e.g., 1183).

Batch Extraction and Recognition

for tr in table.find_elements_by_css_selector("table tr[class^='room first']"):
    name = tr.find_element_by_tag_name("h3").text
    price = tr.find_element_by_css_selector("div>a[class^='totalprice']")
    var_el_s = price.find_elements_by_tag_name("var")
    n = len(var_el_s)
    target = Image.new('RGB', (10 * n, 17), color=(255, 255, 255))
    for i, var in enumerate(var_el_s):
        if img_url is None:
            img_url = var.value_of_css_property("background-image")[5:-2]
            img = download_img(img_url)
        position = var.value_of_css_property("background-position")
        w, h = map(lambda x: int(x[1:-2]), position.split())
        r = img.crop((w, h, w+8, h+17))
        target.paste(r, (10*i, 0), r)
    display(target)
    text = pytesseract.image_to_string(image_binarization(target)).strip()
    print(name, text)

The script prints each room type together with its correctly recognized price, confirming that the OCR accuracy is high.

Overall, the article shows how to combine Selenium stealth techniques, CSS parsing, image cropping, and Tesseract OCR to extract numeric data that is deliberately hidden behind sprite images.

Copyright: This article is originally authored by CSDN blogger “小小明‑代码实体” and is shared under CC 4.0 BY‑SA.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Image processing OCR Web Scraping Selenium tesseract

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.