Backend Development 9 min read

How to Integrate Tess4J OCR into a Spring Boot Application

This article explains OCR fundamentals, introduces Tesseract and its Java wrapper Tess4J, guides you through downloading language data, shows step‑by‑step Spring Boot integration with Maven dependencies and configuration classes, and provides test code for Chinese, English, and mixed‑language image recognition.

Java Architect Essentials

Apr 17, 2026

How to Integrate Tess4J OCR into a Spring Boot Application

1. Overview of OCR and Tess4J

Optical Character Recognition (OCR) transforms printed or handwritten text in documents, PDFs, or images into editable and searchable digital formats. A typical OCR pipeline consists of image preprocessing, text detection, feature extraction, character recognition, and post‑processing.

1.1 OCR workflow

Image preprocessing : acquisition, binarization, denoising, rotation correction, segmentation.

Text detection : locate text lines, words, or character boundaries.

Feature extraction : derive visual features used for recognition.

Character recognition : match extracted features against known character patterns.

Post‑processing : proofreading, formatting, layout analysis to improve readability.

1.2 Tesseract OCR

Tesseract is an open‑source OCR engine originally developed by HP and later sponsored by Google. It is one of the most accurate and widely used OCR engines.

1.3 Tess4J

Tess4J is a Java wrapper for the Tesseract engine, providing a simple API that allows Java applications to perform OCR without dealing with native code. Key characteristics include easy integration, cross‑platform support (Windows, macOS, Linux), full access to Tesseract features (multi‑language, custom training), and an active open‑source community.

2. Download language data

Language data files are hosted at https://github.com/tesseract-ocr/tessdata. For Chinese, download chi_sim.traineddata; for English, download eng.traineddata. Store the files in a directory of your choice and keep the .traineddata extension unchanged.

3. Integrate Tess4J into a Spring Boot project

3.1 Maven dependency

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.1.1</version>
</dependency>

3.2 Application configuration (application.yml)

server:
  port: 11014

tess4j:
  data-path: F:/HeiMaTouTiao/tessdata   # absolute path to the directory containing .traineddata files
  chinese-train-data: chi_sim
  english-train-data: eng

3.3 Configuration properties class

import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
    private String dataPath;
    private String chineseTrainData;
    private String englishTrainData;
    // getters and setters omitted for brevity
}

4. Test image recognition

4.1 Chinese text

Test code (JUnit) that loads a Chinese image and prints the OCR result together with the elapsed time.

import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;

@SpringBootTest
public class Tess4jApplicationTests {
    @Autowired
    private Tess4jConfiguration tess4jConfiguration;

    @Test
    public void testChinese() throws TesseractException {
        long start = System.currentTimeMillis();
        ITesseract tesseract = new Tesseract();
        tesseract.setDatapath(tess4jConfiguration.getDataPath());
        tesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
        File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
        String result = tesseract.doOCR(file);
        long end = System.currentTimeMillis();
        System.err.println("Time: " + (end - start) + " ms");
        System.out.println(result);
    }
}

Sample image:

4.2 English text

Similar test using the English language data.

@Test
public void testEnglish() throws TesseractException {
    long start = System.currentTimeMillis();
    ITesseract tesseract = new Tesseract();
    tesseract.setDatapath(tess4jConfiguration.getDataPath());
    tesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
    File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
    String result = tesseract.doOCR(file);
    long end = System.currentTimeMillis();
    System.err.println("Time: " + (end - start) + " ms");
    System.out.println(result);
}

Sample image:

4.3 Mixed Chinese‑English text

When a document contains both Chinese and English characters, use the Chinese language data (which includes the English alphabet) to achieve combined recognition.

@Test
public void testChineseAndEnglish() throws TesseractException {
    long start = System.currentTimeMillis();
    ITesseract tesseract = new Tesseract();
    tesseract.setDatapath(tess4jConfiguration.getDataPath());
    tesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
    File file = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
    String result = tesseract.doOCR(file);
    long end = System.currentTimeMillis();
    System.err.println("Time: " + (end - start) + " ms");
    System.out.println(result);
}

Sample image:

5. Precautions

The language data files must retain the .traineddata suffix, and the file name prefix (e.g., chi_sim, eng) must exactly match the language identifier used in the Java configuration. The data-path in application.yml must be an absolute directory path accessible to the application at runtime.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java OCR spring-boot image recognition tesseract tess4j Language Data

Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.