How to Elegantly Perform OCR in Spring Boot 3 Using Tess4J

This tutorial explains OCR fundamentals, introduces the open‑source Tesseract engine and its Java wrapper Tess4J, shows how to download the required traineddata files, and provides step‑by‑step Spring Boot 3 integration, configuration, and test code for Chinese, English, and mixed‑language image recognition, plus important usage notes.

java1234
java1234
java1234
How to Elegantly Perform OCR in Spring Boot 3 Using Tess4J

1. OCR workflow

OCR (Optical Character Recognition) converts printed or handwritten text in documents, PDFs, or images into editable, searchable text. The typical workflow consists of:

Image preprocessing – acquisition, binarization, denoising, rotation correction, segmentation.

Text detection – locating lines, words or character boundaries.

Feature extraction – extracting visual features for recognition.

Character recognition – matching features against known character patterns.

Post‑processing – proofreading, formatting and layout analysis.

2. Tesseract OCR and Tess4J

Tesseract OCR is an open‑source engine originally developed by HP and now maintained by Google. Tess4J is a Java wrapper for Tesseract that provides a simple API, cross‑platform support and access to all major Tesseract features, including multilingual recognition and custom training.

3. Download traineddata files

https://github.com/tesseract-ocr/tessdata

For Chinese use chi_sim.traineddata; for English use eng.traineddata. The file names (without the .traineddata suffix) are the language codes required by the Java API.

4. Integrate Tess4J into a Spring Boot project

4.1 Maven dependency

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.1.1</version>
</dependency>

4.2 Application properties (application.yml)

server:
  port: 11014

tess4j:
  data-path: F:/HeiMaTouTiao/tessdata
  chinese-train-data: chi_sim
  english-train-data: eng

4.3 Configuration class

import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
    private String dataPath;
    private String chineseTrainData;
    private String englishTrainData;
    // getters and setters omitted for brevity
}

5. Test image recognition

5.1 Chinese

import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;

@SpringBootTest
public class Tess4jApplicationTests {
    @Autowired
    private Tess4jConfiguration tess4jConfiguration;

    @Test
    public void testChinese() throws TesseractException {
        long start = System.currentTimeMillis();
        ITesseract iTesseract = new Tesseract();
        iTesseract.setDatapath(tess4jConfiguration.getDataPath());
        iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
        File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
        String result = iTesseract.doOCR(file);
        long end = System.currentTimeMillis();
        System.err.println("耗时:" + (end - start) + "ms");
        System.out.println(result);
    }
}

5.2 English

@Test
public void testEnglish() throws TesseractException {
    long start = System.currentTimeMillis();
    ITesseract iTesseract = new Tesseract();
    iTesseract.setDatapath(tess4jConfiguration.getDataPath());
    iTesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
    File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
    String result = iTesseract.doOCR(file);
    long end = System.currentTimeMillis();
    System.err.println("耗时:" + (end - start) + "ms");
    System.out.println(result);
}

5.3 Mixed Chinese‑English

When the image contains both Chinese and English, the Chinese language data must be used.

@Test
public void testChineseAndEnglish() throws TesseractException {
    long start = System.currentTimeMillis();
    ITesseract iTesseract = new Tesseract();
    iTesseract.setDatapath(tess4jConfiguration.getDataPath());
    iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
    File file = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
    String result = iTesseract.doOCR(file);
    long end = System.currentTimeMillis();
    System.err.println("耗时:" + (end - start) + "ms");
    System.out.println(result);
}

6. Important notes

The traineddata files must retain the .traineddata suffix, and the filename prefix (e.g., chi_sim or eng) must match the language code supplied to setLanguage in the Java code.

OCR workflow illustration
OCR workflow illustration
Chinese test result
Chinese test result
English test result
English test result
Mixed language test result
Mixed language test result
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaOCRSpring BootImage RecognitionTesseractTess4J
java1234
Written by

java1234

Former senior programmer at a Fortune Global 500 company, dedicated to sharing Java expertise. Visit Feng's site: Java Knowledge Sharing, www.java1234.com

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.