How to Integrate Tess4j OCR into a Spring Boot 3 Application

This guide explains the fundamentals of OCR, introduces Tesseract and its Java wrapper Tess4j, shows how to download language data files, configure a Spring Boot 3 project with Maven dependencies and YAML settings, and provides comprehensive test code for Chinese, English, and mixed‑language image recognition.

Architecture Digest
Architecture Digest
Architecture Digest
How to Integrate Tess4j OCR into a Spring Boot 3 Application

What is Tess4j

OCR

Optical Character Recognition (OCR) converts printed, handwritten, or image‑based text into editable and searchable digital formats.

Tesseract OCR

Tesseract is an open‑source OCR engine originally developed by HP and now maintained by Google. It is one of the most accurate and widely used OCR tools.

Tess4j

Tess4j is a Java wrapper for the Tesseract engine, providing a simple API for Java applications to perform OCR.

Easy integration : Simple API for Java projects.

Cross‑platform : Runs on any OS that supports Java (Windows, macOS, Linux).

Rich functionality : Supports all major Tesseract features such as multi‑language recognition and custom training.

Active community : Ongoing support and updates from the open‑source community.

Download language data files

Official repository: https://github.com/tesseract-ocr/tessdata

Download chi_sim.traineddata for Chinese and eng.traineddata for English. Place the files in a directory referenced by the application, for example F:/HeiMaTouTiao/tessdata.

Integrate Tess4j into a Spring Boot project

Add Maven dependency

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.1.1</version>
</dependency>

Configure application.yml

server:
  port: 11014

tess4j:
  data-path: F:/HeiMaTouTiao/tessdata
  chinese-train-data: chi_sim
  english-train-data: eng

Create configuration class

import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
    private String dataPath;
    private String chineseTrainData;
    private String englishTrainData;

    public String getDataPath() { return dataPath; }
    public void setDataPath(String dataPath) { this.dataPath = dataPath; }
    public String getChineseTrainData() { return chineseTrainData; }
    public void setChineseTrainData(String chineseTrainData) { this.chineseTrainData = chineseTrainData; }
    public String getEnglishTrainData() { return englishTrainData; }
    public void setEnglishTrainData(String englishTrainData) { this.englishTrainData = englishTrainData; }
}

Test image recognition

Chinese

import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;

@SpringBootTest
public class Tess4jApplicationTests {
    @Autowired
    private Tess4jConfiguration tess4jConfiguration;

    @Test
    public void testChinese() throws TesseractException {
        long start = System.currentTimeMillis();
        ITesseract iTesseract = new Tesseract();
        iTesseract.setDatapath(tess4jConfiguration.getDataPath());
        iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
        File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
        String result = iTesseract.doOCR(file);
        long end = System.currentTimeMillis();
        System.err.println("Time elapsed: " + (end - start) + "ms");
        System.out.println(result);
    }
}

Result: OCR on the Chinese image succeeds; execution time is logged.

English

@Test
public void testEnglish() throws TesseractException {
    long start = System.currentTimeMillis();
    ITesseract iTesseract = new Tesseract();
    iTesseract.setDatapath(tess4jConfiguration.getDataPath());
    iTesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
    File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
    String result = iTesseract.doOCR(file);
    long end = System.currentTimeMillis();
    System.err.println("Time elapsed: " + (end - start) + "ms");
    System.out.println(result);
}

Result: English OCR produces the expected text with similar performance.

Mixed Chinese and English

@Test
public void testChineseAndEnglish() throws TesseractException {
    long start = System.currentTimeMillis();
    ITesseract iTesseract = new Tesseract();
    iTesseract.setDatapath(tess4jConfiguration.getDataPath());
    // Use Chinese trained data for mixed content
    iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
    File file = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
    String result = iTesseract.doOCR(file);
    long end = System.currentTimeMillis();
    System.err.println("Time elapsed: " + (end - start) + "ms");
    System.out.println(result);
}

Result: When both languages appear, the Chinese trained data must be used to correctly recognize the mixed text.

Precautions

The language data files must have the .traineddata suffix, and the file name prefix (e.g., chi_sim, eng) must match the language identifier used in the Java code.

Chinese trained data download page
Chinese trained data download page
English trained data download page
English trained data download page
JavaArtificial IntelligenceOCRSpring BootImage RecognitionTesseractTess4J
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.