How to Elegantly Perform OCR in Spring Boot 3 Using Tess4J
This tutorial explains OCR fundamentals, introduces the open‑source Tesseract engine and its Java wrapper Tess4J, shows how to download the required traineddata files, and provides step‑by‑step Spring Boot 3 integration, configuration, and test code for Chinese, English, and mixed‑language image recognition, plus important usage notes.
1. OCR workflow
OCR (Optical Character Recognition) converts printed or handwritten text in documents, PDFs, or images into editable, searchable text. The typical workflow consists of:
Image preprocessing – acquisition, binarization, denoising, rotation correction, segmentation.
Text detection – locating lines, words or character boundaries.
Feature extraction – extracting visual features for recognition.
Character recognition – matching features against known character patterns.
Post‑processing – proofreading, formatting and layout analysis.
2. Tesseract OCR and Tess4J
Tesseract OCR is an open‑source engine originally developed by HP and now maintained by Google. Tess4J is a Java wrapper for Tesseract that provides a simple API, cross‑platform support and access to all major Tesseract features, including multilingual recognition and custom training.
3. Download traineddata files
https://github.com/tesseract-ocr/tessdata
For Chinese use chi_sim.traineddata; for English use eng.traineddata. The file names (without the .traineddata suffix) are the language codes required by the Java API.
4. Integrate Tess4J into a Spring Boot project
4.1 Maven dependency
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.1.1</version>
</dependency>4.2 Application properties (application.yml)
server:
port: 11014
tess4j:
data-path: F:/HeiMaTouTiao/tessdata
chinese-train-data: chi_sim
english-train-data: eng4.3 Configuration class
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
private String dataPath;
private String chineseTrainData;
private String englishTrainData;
// getters and setters omitted for brevity
}5. Test image recognition
5.1 Chinese
import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;
@SpringBootTest
public class Tess4jApplicationTests {
@Autowired
private Tess4jConfiguration tess4jConfiguration;
@Test
public void testChinese() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("耗时:" + (end - start) + "ms");
System.out.println(result);
}
}5.2 English
@Test
public void testEnglish() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
iTesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("耗时:" + (end - start) + "ms");
System.out.println(result);
}5.3 Mixed Chinese‑English
When the image contains both Chinese and English, the Chinese language data must be used.
@Test
public void testChineseAndEnglish() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("耗时:" + (end - start) + "ms");
System.out.println(result);
}6. Important notes
The traineddata files must retain the .traineddata suffix, and the filename prefix (e.g., chi_sim or eng) must match the language code supplied to setLanguage in the Java code.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
java1234
Former senior programmer at a Fortune Global 500 company, dedicated to sharing Java expertise. Visit Feng's site: Java Knowledge Sharing, www.java1234.com
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
