Why I Dropped WPS for Python: Handling 30 000 Images and 300 000 Records with Batch Pre‑processing

Faced with 30,000 report images containing 300,000 rows of tabular data, the author explains why WPS failed at scale, analyzes the OCR error sources, and shares a Python script that batch‑crops images to boost recognition accuracy before exporting everything to Excel.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Why I Dropped WPS for Python: Handling 30 000 Images and 300 000 Records with Batch Pre‑processing

Business Scenario

The project required extracting structured table data from over 30,000 business report images that together represent 300,000 data rows , and consolidating everything into a single Excel worksheet.

Why Traditional WPS Workflow Collapsed

Most people assume WPS can convert images to tables, but in a real batch setting it breaks down:

Batch importing images into WPS crashes; the application cannot handle the volume.

Workarounds such as converting images to PDF then to tables achieve only about 95% accuracy , leaving a 5% error rate that translates to tens of thousands of wrong records.

Low‑code tool WorkBuddy cannot invoke WPS’s high‑precision OCR, and a custom recognizer performed poorly.

When a single image is processed, WPS OCR is 100% accurate ; the bottleneck is the lack of an automated, stable batch pipeline.

Root Cause of OCR Errors

Manual testing showed that dense header text and crowded fields at the top of the images generate the majority of recognition mistakes. Those header sections are already archived and do not need OCR, so removing them should improve overall accuracy.

Python Batch Image Cropping Solution

The author provides a self‑contained Python script that crops each image to discard the interfering top region, preparing the files for high‑precision OCR.

# Batch image cropping preprocessing code
import os
from PIL import Image

# Configure paths
input_path = "./img_raw"   # Original image folder
output_path = "./img_cut"   # Cropped image folder
os.makedirs(output_path, exist_ok=True)

# Cropping parameters (left, top, right, bottom) – adjust per image set
crop_left = 0
crop_top = 120
crop_right = 9000
crop_bottom = 5000

def crop_image(img_path, save_path):
    try:
        img = Image.open(img_path)
        new_img = img.crop((crop_left, crop_top, crop_right, crop_bottom))
        new_img.save(save_path)
        return True
    except Exception as e:
        print("Cropping failed:", e)
        return False

# Process all files in the input directory
for file in os.listdir(input_path):
    if file.endswith((".jpg", ".png", ".jpeg")):
        crop_image(os.path.join(input_path, file), os.path.join(output_path, file))
        print("Processed:", file)

print("✅ All images cropped, ready for OCR stage")

After running the script, every image is trimmed to remove the noisy header, dramatically reducing OCR mis‑recognition.

Key Takeaway

In large‑scale data extraction, preprocessing the images is often more critical than the OCR engine itself; invisible interference can corrupt tens of thousands of records. The next article will demonstrate how to run batch OCR on the cleaned images and push accuracy beyond 99%.

Example of original report image
Example of original report image
Cropped image showing removed header
Cropped image showing removed header
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

batch-processingOCRdata extractionpillowimage preprocessing
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.