Artificial Intelligence 8 min read

Using pytesseract and Pillow for OCR: Installation, Configuration, and Accuracy Improvement Techniques

This guide explains how to install Tesseract OCR and the Python libraries pytesseract and Pillow, configure the engine path, perform image-to-text extraction with example code, and apply various preprocessing, detection, and post‑processing methods to significantly improve OCR accuracy.

Test Development Learning Exchange

Dec 6, 2024

Using pytesseract and Pillow for OCR: Installation, Configuration, and Accuracy Improvement Techniques

This article provides a step‑by‑step tutorial for performing OCR using the Tesseract engine with Python's pytesseract and Pillow libraries.

1. Install required libraries – Install Tesseract OCR (Windows: download installer; macOS: brew install tesseract; Linux: sudo apt-get install tesseract-ocr) and then install the Python packages with pip install pytesseract pillow.

2. Set Tesseract path (Windows only) – Add the executable path in Python:

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

3. Read image and perform OCR – Example code loads an image and extracts text:

import pytesseract
from PIL import Image
# image_path = 'id_card.jpg'
image = Image.open(image_path)
text = pytesseract.image_to_string(image)  # Simplified Chinese
print(text)

4. Process OCR results – Use regular expressions to extract specific fields such as name and ID number:

import re
name_pattern = re.compile(r'姓名\s*([\u4e00-\u9fa5]+)')
id_pattern = re.compile(r'公民身份号码\s*(\d{18})')
name_match = name_pattern.search(text)
id_match = id_pattern.search(text)
if name_match:
    print(f"姓名: {name_match.group(1)}")
else:
    print("未找到姓名")
if id_match:
    print(f"公民身份号码: {id_match.group(1)}")
else:
    print("未找到公民身份号码")

5. Improve OCR accuracy – Techniques include image preprocessing (brightness, contrast, denoising, sharpening), using appropriate language models, and block‑wise recognition. Example preprocessing code:

from PIL import ImageEnhance
image = Image.open(image_path)
enhancer = ImageEnhance.Brightness(image)
image_enhanced = enhancer.enhance(1.5)
enhancer = ImageEnhance.Contrast(image_enhanced)
image_enhanced = enhancer.enhance(1.5)
text = pytesseract.image_to_string(image_enhanced)
print(text)

Additional methods cover text region detection with OpenCV, selecting high‑quality OCR engines (Tesseract, Google Cloud Vision, ABBYY FineReader), applying language models and dictionaries, block‑wise OCR, and post‑processing such as spell checking and regex extraction.

Summary – By combining proper installation, configuration, preprocessing, region detection, engine selection, and post‑processing, OCR recognition accuracy can be substantially enhanced.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning computer vision Python OCR pytesseract tesseract

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.