How to Preprocess Captcha Images with OpenCV for Python Scraping
This tutorial explains how to collect captcha images, preprocess them with OpenCV—including grayscale conversion, median blur, binarization, contour detection, and character segmentation—and provides core Python code and visual results for building a reliable Python web‑scraping pipeline.
Introduction
The previous article introduced the project requirements and overall design for a Python captcha labeling and recognition system. This article continues by detailing data collection, image preprocessing, and character segmentation steps.
Data Collection
Captcha images are downloaded in batches from the provided URL. An initial set of about 20 images is manually renamed for labeling. The download script (image_download.py) is omitted for brevity.
Preprocessing
The preprocessing pipeline implemented with OpenCV follows this flow:
原始图->灰度图->中值滤波->二值化->轮廓检测绘制(部分情况才可以加)->字符切割填充Key steps:
Grayscale conversion removes color information, reduces data size, and simplifies subsequent filtering.
Median blur eliminates noise by averaging neighboring pixel values.
Binarization keeps only pixel values 0 or 255, facilitating contour detection.
Contour detection extracts bounding rectangles for characters; it is optional for tightly spaced characters.
Character segmentation and padding cuts each character using the detected rectangles and resizes them to a uniform width and height.
def pre_process_image(img, file_name):
# 去除边缘
img = img[2:-2, 2:-2]
#得到灰度图
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
#去除噪音
blur = cv2.medianBlur(gray, 3)
temp = gray.mean().item()
#二值化
ret, threshold = cv2.threshold(blur, temp, 255, cv2.THRESH_BINARY)
if IS_SAVE_FILE:
cv2.imwrite(DST_IMG_DIR + file_name + "_threshold.png", threshold)
return thresholdSample preprocessing results are shown below:
Character Segmentation
def split_image(file_path):
file_name = get_file_name(file_path)
img = read_image(file_path)
#验证码预处理
threshold = pre_process_image(img, file_name)
#查找轮廓边界列表
contours = find_counters(threshold)
#过滤合适的轮廓矩形列表
rect_list, result_rect = get_filter_rect(contours, img, file_name)
#分割矩形图片
return split_rect_img(file_path, threshold, rect_list, result_rect)Challenges encountered:
Some images yield multiple contours, some only one or two, and a few none; the code filters large outer contours and small inner ones, then sorts the remaining coordinates to obtain all character boxes.
After segmentation, character images have inconsistent sizes; padding to a fixed width and height is required, with parameters tuned to the specific dataset.
Conclusion
The article demonstrated the complete workflow for collecting captcha images, preprocessing them with OpenCV, and segmenting characters using Python. The next article will introduce a high‑efficiency, reusable tool for generic captcha data labeling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
