How to Integrate Tess4j OCR into a Spring Boot 3 Application
This guide explains the fundamentals of OCR, introduces Tesseract and its Java wrapper Tess4j, shows how to download language data files, configure a Spring Boot 3 project with Maven dependencies and YAML settings, and provides comprehensive test code for Chinese, English, and mixed‑language image recognition.
What is Tess4j
OCR
Optical Character Recognition (OCR) converts printed, handwritten, or image‑based text into editable and searchable digital formats.
Tesseract OCR
Tesseract is an open‑source OCR engine originally developed by HP and now maintained by Google. It is one of the most accurate and widely used OCR tools.
Tess4j
Tess4j is a Java wrapper for the Tesseract engine, providing a simple API for Java applications to perform OCR.
Easy integration : Simple API for Java projects.
Cross‑platform : Runs on any OS that supports Java (Windows, macOS, Linux).
Rich functionality : Supports all major Tesseract features such as multi‑language recognition and custom training.
Active community : Ongoing support and updates from the open‑source community.
Download language data files
Official repository: https://github.com/tesseract-ocr/tessdata
Download chi_sim.traineddata for Chinese and eng.traineddata for English. Place the files in a directory referenced by the application, for example F:/HeiMaTouTiao/tessdata.
Integrate Tess4j into a Spring Boot project
Add Maven dependency
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.1.1</version>
</dependency>Configure application.yml
server:
port: 11014
tess4j:
data-path: F:/HeiMaTouTiao/tessdata
chinese-train-data: chi_sim
english-train-data: engCreate configuration class
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
private String dataPath;
private String chineseTrainData;
private String englishTrainData;
public String getDataPath() { return dataPath; }
public void setDataPath(String dataPath) { this.dataPath = dataPath; }
public String getChineseTrainData() { return chineseTrainData; }
public void setChineseTrainData(String chineseTrainData) { this.chineseTrainData = chineseTrainData; }
public String getEnglishTrainData() { return englishTrainData; }
public void setEnglishTrainData(String englishTrainData) { this.englishTrainData = englishTrainData; }
}Test image recognition
Chinese
import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;
@SpringBootTest
public class Tess4jApplicationTests {
@Autowired
private Tess4jConfiguration tess4jConfiguration;
@Test
public void testChinese() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("Time elapsed: " + (end - start) + "ms");
System.out.println(result);
}
}Result: OCR on the Chinese image succeeds; execution time is logged.
English
@Test
public void testEnglish() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
iTesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("Time elapsed: " + (end - start) + "ms");
System.out.println(result);
}Result: English OCR produces the expected text with similar performance.
Mixed Chinese and English
@Test
public void testChineseAndEnglish() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract iTesseract = new Tesseract();
iTesseract.setDatapath(tess4jConfiguration.getDataPath());
// Use Chinese trained data for mixed content
iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
String result = iTesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("Time elapsed: " + (end - start) + "ms");
System.out.println(result);
}Result: When both languages appear, the Chinese trained data must be used to correctly recognize the mixed text.
Precautions
The language data files must have the .traineddata suffix, and the file name prefix (e.g., chi_sim, eng) must match the language identifier used in the Java code.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
