How to Perform OCR in SpringBoot Using Tess4j
This tutorial explains OCR fundamentals, introduces Tesseract and its Java wrapper Tess4j, shows how to download language data, integrate Tess4j into a SpringBoot 3 project with Maven configuration, and provides test code for Chinese, English, and mixed‑language image recognition while highlighting performance considerations.
OCR workflow
Optical Character Recognition (OCR) converts printed or handwritten text in documents, PDFs, or images into editable and searchable digital text. The typical workflow consists of:
Image preprocessing : acquisition, binarization, denoising, rotation correction, segmentation, etc.
Text detection : locating text lines, words or character boundaries.
Feature extraction : extracting visual features for recognition.
Character recognition : matching extracted features against known character patterns.
Post‑processing : proofreading, formatting and layout analysis to improve readability.
Tesseract OCR engine
Tesseract is an open‑source OCR engine originally developed by HP Labs, open‑sourced in 2005 and now maintained by Google. It is one of the most accurate and widely used OCR tools.
Tess4j Java wrapper
Tess4j provides a Java API that wraps the Tesseract engine, enabling Java applications to invoke OCR functions directly. Its main characteristics are:
Easy integration : simple API for Java projects.
Cross‑platform : runs on any OS with a Java runtime (Windows, macOS, Linux).
Rich functionality : supports all major Tesseract features, including multi‑language recognition and custom training.
Active community : open‑source project with regular updates.
Download language data
Language data files are required by Tesseract. They can be obtained from the official repository:
https://github.com/tesseract-ocr/tessdata
For Chinese, download chi_sim.traineddata; for English, download eng.traineddata. Place the files in a local directory, e.g. F:/HeiMaTouTiao/tessdata, and note the absolute path.
Integrate Tess4j into a Spring Boot project
Environment: JDK 17.0.7 + Spring Boot 3.0.2.
Add Maven dependency
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.1.1</version>
</dependency>Configure application.yml
server:
port: 11014
tess4j:
data-path: F:/HeiMaTouTiao/tessdata
chinese-train-data: chi_sim
english-train-data: engCreate configuration class
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
private String dataPath;
private String chineseTrainData;
private String englishTrainData;
// getters and setters omitted for brevity
}Test image recognition
Chinese OCR
import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;
@SpringBootTest
public class Tess4jApplicationTests {
@Autowired
private Tess4jConfiguration tess4jConfiguration;
@Test
public void testChinese() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("耗时:" + (end - start) + "ms");
System.out.println(result);
}
}The test prints the OCR result and the elapsed time, demonstrating that OCR can be time‑consuming.
English OCR
@Test
public void testEnglish() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
iTesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("耗时:" + (end - start) + "ms");
System.out.println(result);
}Chinese‑English mixed OCR
When the image contains both Chinese and English, the Chinese traineddata must be used.
@Test
public void testChineseAndEnglish() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("耗时:" + (end - start) + "ms");
System.out.println(result);
}Important notes
Language data files must have the .traineddata suffix and the filename prefix (e.g., chi_sim, eng) must match the value set in the Java configuration.
SpringMeng
Focused on software development, sharing source code and tutorials for various systems.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
