Using AI and RPA to Solve Slider Captcha: A Practical Implementation with YOLOv8 and PyAutoGUI
This article demonstrates how to combine AI‑based object detection (YOLOv8) with robotic process automation (pyautogui) to automatically locate, drag and release slider captchas, covering data preparation, model training, screen capture, coordinate extraction, mouse simulation, and robustness improvements.
The author explains why traditional slider‑captcha solving methods (HTML element inspection, OpenCV binarization, RGB pixel comparison) are fragile and proposes an AI‑driven solution that mimics a human user’s mouse actions.
First, a feasibility test is performed using pyautogui to move the mouse from a manually measured start coordinate (260, 940) to the target position (363, 940), confirming that simple mouse simulation works.
Next, a YOLOv8 model is trained on over 100 annotated captcha screenshots using labelImg . The best model weights ( best.pt ) are obtained after 100 epochs, achieving high detection confidence for objects such as start block, target block, operate block, and refresh button.
# 首先pip install pyautogui安装库
import pyautogui
# 设置起始位置
start_x = 260
start_y = 940
# 设置拖动距离
drag_distance = 363-260
# 移动到起始位置
pyautogui.moveTo(start_x, start_y, duration=1)
# 按下鼠标左键
pyautogui.mouseDown()
# 拖动鼠标
pyautogui.moveRel(drag_distance, 0, duration=1)
# 松开鼠标左键
pyautogui.mouseUp()For full automation, the screen is captured with pyautogui.screenshot() , optionally cropped to the captcha region, and fed to the YOLOv8 model:
import pyautogui
from ultralytics import YOLO
# 截取整个屏幕
screenshot = pyautogui.screenshot()
# 裁剪目标区域
x1,y1,x2,y2 = 140, 110, 570, 510
cropped = screenshot.crop((x1,y1,x2,y2))
# 进行识别
model = YOLO('best.pt')
results = model.predict(source=cropped, save=True)The detection results provide bounding boxes with class indices and confidence scores. By extracting the centers of the start and target blocks, the required drag distance is computed, and the mouse is moved to the operate block before performing the drag:
# 识别结果
b_datas = results[0].boxes.data
points = {}
names = {0: 'start', 1: 'target', 2: 'fill', 3: 'operate', 4: 'refresh'}
for box in b_datas:
if box[4] > 0.65:
name = names[int(box[5])]
points[name] = np.array(box[:4], np.int32)
start_box = points["start"] if "start" in points else points["operate"]
target_box = points["target"]
operate_box = points["operate"]
centerx_start = (start_box[0] + start_box[2])//2
centerx_target = (target_box[0] + target_box[2])//2
drag_distance = centerx_target - centerx_start
centerx_op = (operate_box[0] + operate_box[2])/2 + x1
centery_op = (operate_box[1] + operate_box[3])/2 + y1
pyautogui.moveTo(centerx_op, centery_op, duration=1)
pyautogui.mouseDown()
pyautogui.moveRel(drag_distance, 0, duration=1)
pyautogui.mouseUp()To improve robustness, the script checks for missing elements and falls back to using the operate block as the start point when the start block is not detected, raising an exception if essential objects are absent.
The article concludes that while the AI + RPA pipeline can reliably solve slider captchas, further work is needed for commercial‑grade stability, such as automatic window detection, dynamic region selection, and handling diverse captcha designs.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.