How to Preprocess Captcha Images with OpenCV for Python Scraping

This tutorial explains how to collect captcha images, preprocess them with OpenCV—including grayscale conversion, median blur, binarization, contour detection, and character segmentation—and provides core Python code and visual results for building a reliable Python web‑scraping pipeline.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Preprocess Captcha Images with OpenCV for Python Scraping

Introduction

The previous article introduced the project requirements and overall design for a Python captcha labeling and recognition system. This article continues by detailing data collection, image preprocessing, and character segmentation steps.

Data Collection

Captcha images are downloaded in batches from the provided URL. An initial set of about 20 images is manually renamed for labeling. The download script (image_download.py) is omitted for brevity.

Preprocessing

The preprocessing pipeline implemented with OpenCV follows this flow:

原始图->灰度图->中值滤波->二值化->轮廓检测绘制(部分情况才可以加)->字符切割填充

Key steps:

Grayscale conversion removes color information, reduces data size, and simplifies subsequent filtering.

Median blur eliminates noise by averaging neighboring pixel values.

Binarization keeps only pixel values 0 or 255, facilitating contour detection.

Contour detection extracts bounding rectangles for characters; it is optional for tightly spaced characters.

Character segmentation and padding cuts each character using the detected rectangles and resizes them to a uniform width and height.

def pre_process_image(img, file_name):
    # 去除边缘
    img = img[2:-2, 2:-2]
    #得到灰度图
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    #去除噪音
    blur = cv2.medianBlur(gray, 3)
    temp = gray.mean().item()
    #二值化
    ret, threshold = cv2.threshold(blur, temp, 255, cv2.THRESH_BINARY)
    if IS_SAVE_FILE:
        cv2.imwrite(DST_IMG_DIR + file_name + "_threshold.png", threshold)
    return threshold

Sample preprocessing results are shown below:

Character Segmentation

def split_image(file_path):
    file_name = get_file_name(file_path)
    img = read_image(file_path)

    #验证码预处理
    threshold = pre_process_image(img, file_name)

    #查找轮廓边界列表
    contours = find_counters(threshold)

    #过滤合适的轮廓矩形列表
    rect_list, result_rect = get_filter_rect(contours, img, file_name)

    #分割矩形图片
    return split_rect_img(file_path, threshold, rect_list, result_rect)

Challenges encountered:

Some images yield multiple contours, some only one or two, and a few none; the code filters large outer contours and small inner ones, then sorts the remaining coordinates to obtain all character boxes.

After segmentation, character images have inconsistent sizes; padding to a fixed width and height is required, with parameters tuned to the specific dataset.

Conclusion

The article demonstrated the complete workflow for collecting captcha images, preprocessing them with OpenCV, and segmenting characters using Python. The next article will introduce a high‑efficiency, reusable tool for generic captcha data labeling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CaptchaOpenCVWeb Scrapingimage preprocessing
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.