Backend Development 11 min read

Integrating Tess4j OCR into a Spring Boot 3 Project

This guide explains OCR fundamentals, introduces Tesseract and Tess4j, shows how to download the required language data files, and provides step‑by‑step instructions with Maven configuration, Spring Boot properties, Java code, and test examples for Chinese, English, and mixed‑language image recognition.

Java Architect Handbook

Apr 1, 2026

Integrating Tess4j OCR into a Spring Boot 3 Project

1. Overview of OCR and Tess4j

Optical Character Recognition (OCR) transforms printed or handwritten text in images into editable, searchable digital text. A typical OCR pipeline consists of image preprocessing, text detection, feature extraction, character recognition, and post‑processing.

Tesseract OCR is an open‑source engine originally created by HP and now maintained by Google. Tess4j is a Java wrapper for Tesseract that exposes a simple API, allowing Java applications to perform OCR without dealing with native libraries directly.

2. Download language data files

The trained data files required by Tesseract are hosted at https://github.com/tesseract-ocr/tessdata. For a bilingual Chinese‑English example download the following files and place them in a directory that will be referenced by the application (e.g., F:/HeiMaTouTiao/tessdata): chi_sim.traineddata (Simplified Chinese) eng.traineddata (English)

3. Integrating Tess4j into a Spring Boot project

3.1 Maven dependency

<dependency>
  <groupId>net.sourceforge.tess4j</groupId>
  <artifactId>tess4j</artifactId>
  <version>4.1.1</version>
</dependency>

3.2 Application configuration (application.yml)

server:
  port: 11014

tess4j:
  data-path: F:/HeiMaTouTiao/tessdata
  chinese-train-data: chi_sim
  english-train-data: eng

Replace F:/HeiMaTouTiao/tessdata with the absolute path where the .traineddata files are stored.

3.3 Configuration properties class

import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
    private String dataPath;
    private String chineseTrainData;
    private String englishTrainData;
    // getters and setters omitted for brevity
}

3.4 Test code (JUnit 5)

Chinese example

import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;

@SpringBootTest
public class Tess4jApplicationTests {
    @Autowired
    private Tess4jConfiguration tess4jConfiguration;

    @Test
    public void testChinese() throws TesseractException {
        long start = System.currentTimeMillis();
        ITesseract tesseract = new Tesseract();
        tesseract.setDatapath(tess4jConfiguration.getDataPath());
        tesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
        File img = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
        String result = tesseract.doOCR(img);
        long end = System.currentTimeMillis();
        System.err.println("Time elapsed: " + (end - start) + " ms");
        System.out.println(result);
    }
}

English example (change language to eng and use an English test image).

@Test
    public void testEnglish() throws TesseractException {
        long start = System.currentTimeMillis();
        ITesseract tesseract = new Tesseract();
        tesseract.setDatapath(tess4jConfiguration.getDataPath());
        tesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
        File img = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
        String result = tesseract.doOCR(img);
        long end = System.currentTimeMillis();
        System.err.println("Time elapsed: " + (end - start) + " ms");
        System.out.println(result);
    }

Mixed Chinese‑English example (use the Chinese language data; Tesseract can recognise both scripts when the appropriate language pack is loaded).

@Test
    public void testChineseAndEnglish() throws TesseractException {
        long start = System.currentTimeMillis();
        ITesseract tesseract = new Tesseract();
        tesseract.setDatapath(tess4jConfiguration.getDataPath());
        tesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
        File img = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
        String result = tesseract.doOCR(img);
        long end = System.currentTimeMillis();
        System.err.println("Time elapsed: " + (end - start) + " ms");
        System.out.println(result);
    }

4. Test images and expected results

Chinese image:

English image:

Mixed Chinese‑English image:

5. Important notes

The language data files must retain the .traineddata suffix, and the filename prefix (e.g., chi_sim or eng) must exactly match the language code supplied to ITesseract.setLanguage(). Using mismatched names will cause Tesseract to fail loading the language pack.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java OCR Spring Boot image recognition tesseract tess4j

Written by

Java Architect Handbook

Focused on Java interview questions and practical article sharing, covering algorithms, databases, Spring Boot, microservices, high concurrency, JVM, Docker containers, and ELK-related knowledge. Looking forward to progressing together with you.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.