How to Integrate Tess4J OCR into a Spring Boot Application
This article explains OCR fundamentals, introduces Tesseract and its Java wrapper Tess4J, guides you through downloading language data, shows step‑by‑step Spring Boot integration with Maven dependencies and configuration classes, and provides test code for Chinese, English, and mixed‑language image recognition.
1. Overview of OCR and Tess4J
Optical Character Recognition (OCR) transforms printed or handwritten text in documents, PDFs, or images into editable and searchable digital formats. A typical OCR pipeline consists of image preprocessing, text detection, feature extraction, character recognition, and post‑processing.
1.1 OCR workflow
Image preprocessing : acquisition, binarization, denoising, rotation correction, segmentation.
Text detection : locate text lines, words, or character boundaries.
Feature extraction : derive visual features used for recognition.
Character recognition : match extracted features against known character patterns.
Post‑processing : proofreading, formatting, layout analysis to improve readability.
1.2 Tesseract OCR
Tesseract is an open‑source OCR engine originally developed by HP and later sponsored by Google. It is one of the most accurate and widely used OCR engines.
1.3 Tess4J
Tess4J is a Java wrapper for the Tesseract engine, providing a simple API that allows Java applications to perform OCR without dealing with native code. Key characteristics include easy integration, cross‑platform support (Windows, macOS, Linux), full access to Tesseract features (multi‑language, custom training), and an active open‑source community.
2. Download language data
Language data files are hosted at https://github.com/tesseract-ocr/tessdata. For Chinese, download chi_sim.traineddata; for English, download eng.traineddata. Store the files in a directory of your choice and keep the .traineddata extension unchanged.
3. Integrate Tess4J into a Spring Boot project
3.1 Maven dependency
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.1.1</version>
</dependency>3.2 Application configuration (application.yml)
server:
port: 11014
tess4j:
data-path: F:/HeiMaTouTiao/tessdata # absolute path to the directory containing .traineddata files
chinese-train-data: chi_sim
english-train-data: eng3.3 Configuration properties class
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
private String dataPath;
private String chineseTrainData;
private String englishTrainData;
// getters and setters omitted for brevity
}4. Test image recognition
4.1 Chinese text
Test code (JUnit) that loads a Chinese image and prints the OCR result together with the elapsed time.
import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;
@SpringBootTest
public class Tess4jApplicationTests {
@Autowired
private Tess4jConfiguration tess4jConfiguration;
@Test
public void testChinese() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract tesseract = new Tesseract();
tesseract.setDatapath(tess4jConfiguration.getDataPath());
tesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
String result = tesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("Time: " + (end - start) + " ms");
System.out.println(result);
}
}Sample image:
4.2 English text
Similar test using the English language data.
@Test
public void testEnglish() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract tesseract = new Tesseract();
tesseract.setDatapath(tess4jConfiguration.getDataPath());
tesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
String result = tesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("Time: " + (end - start) + " ms");
System.out.println(result);
}Sample image:
4.3 Mixed Chinese‑English text
When a document contains both Chinese and English characters, use the Chinese language data (which includes the English alphabet) to achieve combined recognition.
@Test
public void testChineseAndEnglish() throws TesseractException {
long start = System.currentTimeMillis();
ITesseract tesseract = new Tesseract();
tesseract.setDatapath(tess4jConfiguration.getDataPath());
tesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
File file = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
String result = tesseract.doOCR(file);
long end = System.currentTimeMillis();
System.err.println("Time: " + (end - start) + " ms");
System.out.println(result);
}Sample image:
5. Precautions
The language data files must retain the .traineddata suffix, and the file name prefix (e.g., chi_sim, eng) must exactly match the language identifier used in the Java configuration. The data-path in application.yml must be an absolute directory path accessible to the application at runtime.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
