Step‑by‑Step SpringBoot + Tess4j Guide to Implement OCR for PDF Images

This article walks through extracting images from PDF files in a SpringBoot application and using Tess4j to perform OCR, comparing popular OCR libraries, showing configuration details, code snippets, and tips for improving accuracy and performance.

SpringMeng
SpringMeng
SpringMeng
Step‑by‑Step SpringBoot + Tess4j Guide to Implement OCR for PDF Images

OCR library comparison

Tesseract (Google) : 100+ languages, moderate speed, simple SpringBoot integration.

PaddleOCR (Baidu) : 80+ languages, fast, excellent Chinese optimization, moderate SpringBoot integration.

EasyOCR (aided AI) : 80+ languages, fast, good Chinese support, simple SpringBoot integration.

TrOCR (Microsoft) : multilingual, excellent accuracy, moderate speed, complex SpringBoot integration.

Implementation

1. Description

SpringBoot combined with Tess4j (a Java wrapper for Tesseract) provides a lightweight RESTful OCR service. The service can process scanned documents, screenshots, or any image containing text.

2. Code

2.1 Add Maven dependency

<dependency>
  <groupId>net.sourceforge.tess4j</groupId>
  <artifactId>tess4j</artifactId>
</dependency>

2.2 Initialize Tesseract engine

Load training data from

new ClassPathResource("tess_data").getFile().getAbsolutePath()

. When packaged as a JAR, this path may be inaccessible; copying the resource to a temporary location (e.g., via a helper like TensorflowUtil) resolves the issue.

On Linux, ensure native libraries for net.sourceforge.tess4j.TessAPI are available.

Training data options: tessdata_best: highest accuracy, slower. tessdata: balanced speed and accuracy. tessdata_fast: faster, slightly lower accuracy.

public class TesseractOcrModelService {
    private final Tesseract tesseract = new Tesseract();
    public TesseractOcrModelService() {
        try {
            String folderPath = new ClassPathResource("tess_data").getFile().getAbsolutePath();
            tesseract.setDatapath(folderPath);
            // Combined Tesseract + LSTM mode (value 2)
            tesseract.setPageSegMode(OEM_TESSERACT_LSTM_COMBINED);
            tesseract.setLanguage("chi_sim"); // Simplified Chinese
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
    public Tesseract getTesseract() { return tesseract; }
}

2.3 RESTful controller

@RestController
@RequestMapping("ocr")
@RequiredArgsConstructor
public class OcrController {
    private final TesseractOcrModelService tesseractOcrModelService;

    @PostMapping("/detection")
    public Result<String> ocrDetection(MultipartFile file) {
        try {
            Tesseract tesseract = tesseractOcrModelService.getTesseract();
            return Result.success(tesseract.doOCR(ImageIO.read(file.getInputStream())));
        } catch (Exception e) {
            throw new RuntimeException("ImageIO.read parsing error", e);
        }
    }
}

Source repository

https://gitee.com/fateyifei/yf

Observations

Tess4j reliably recognizes ID numbers, phone numbers, and English words. Using the free training data, Chinese character recognition is less accurate. Higher quality can be achieved by training custom data sets or by invoking third‑party OCR APIs such as Google Cloud Vision, Microsoft Azure OCR, or Amazon Textract.

JavaOCRSpringBootTesseractTess4J
SpringMeng
Written by

SpringMeng

Focused on software development, sharing source code and tutorials for various systems.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.