Step‑by‑Step SpringBoot + Tess4j Guide to Implement OCR for PDF Images
This article walks through extracting images from PDF files in a SpringBoot application and using Tess4j to perform OCR, comparing popular OCR libraries, showing configuration details, code snippets, and tips for improving accuracy and performance.
OCR library comparison
Tesseract (Google) : 100+ languages, moderate speed, simple SpringBoot integration.
PaddleOCR (Baidu) : 80+ languages, fast, excellent Chinese optimization, moderate SpringBoot integration.
EasyOCR (aided AI) : 80+ languages, fast, good Chinese support, simple SpringBoot integration.
TrOCR (Microsoft) : multilingual, excellent accuracy, moderate speed, complex SpringBoot integration.
Implementation
1. Description
SpringBoot combined with Tess4j (a Java wrapper for Tesseract) provides a lightweight RESTful OCR service. The service can process scanned documents, screenshots, or any image containing text.
2. Code
2.1 Add Maven dependency
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
</dependency>2.2 Initialize Tesseract engine
Load training data from
new ClassPathResource("tess_data").getFile().getAbsolutePath(). When packaged as a JAR, this path may be inaccessible; copying the resource to a temporary location (e.g., via a helper like TensorflowUtil) resolves the issue.
On Linux, ensure native libraries for net.sourceforge.tess4j.TessAPI are available.
Training data options: tessdata_best: highest accuracy, slower. tessdata: balanced speed and accuracy. tessdata_fast: faster, slightly lower accuracy.
public class TesseractOcrModelService {
private final Tesseract tesseract = new Tesseract();
public TesseractOcrModelService() {
try {
String folderPath = new ClassPathResource("tess_data").getFile().getAbsolutePath();
tesseract.setDatapath(folderPath);
// Combined Tesseract + LSTM mode (value 2)
tesseract.setPageSegMode(OEM_TESSERACT_LSTM_COMBINED);
tesseract.setLanguage("chi_sim"); // Simplified Chinese
} catch (Exception e) {
throw new RuntimeException(e);
}
}
public Tesseract getTesseract() { return tesseract; }
}2.3 RESTful controller
@RestController
@RequestMapping("ocr")
@RequiredArgsConstructor
public class OcrController {
private final TesseractOcrModelService tesseractOcrModelService;
@PostMapping("/detection")
public Result<String> ocrDetection(MultipartFile file) {
try {
Tesseract tesseract = tesseractOcrModelService.getTesseract();
return Result.success(tesseract.doOCR(ImageIO.read(file.getInputStream())));
} catch (Exception e) {
throw new RuntimeException("ImageIO.read parsing error", e);
}
}
}Source repository
https://gitee.com/fateyifei/yf
Observations
Tess4j reliably recognizes ID numbers, phone numbers, and English words. Using the free training data, Chinese character recognition is less accurate. Higher quality can be achieved by training custom data sets or by invoking third‑party OCR APIs such as Google Cloud Vision, Microsoft Azure OCR, or Amazon Textract.
SpringMeng
Focused on software development, sharing source code and tutorials for various systems.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
