How to Perform OCR in SpringBoot Using Tess4j

This tutorial explains OCR fundamentals, introduces Tesseract and its Java wrapper Tess4j, shows how to download language data, integrate Tess4j into a SpringBoot 3 project with Maven configuration, and provides test code for Chinese, English, and mixed‑language image recognition while highlighting performance considerations.

SpringMeng
SpringMeng
SpringMeng
How to Perform OCR in SpringBoot Using Tess4j

OCR workflow

Optical Character Recognition (OCR) converts printed or handwritten text in documents, PDFs, or images into editable and searchable digital text. The typical workflow consists of:

Image preprocessing : acquisition, binarization, denoising, rotation correction, segmentation, etc.

Text detection : locating text lines, words or character boundaries.

Feature extraction : extracting visual features for recognition.

Character recognition : matching extracted features against known character patterns.

Post‑processing : proofreading, formatting and layout analysis to improve readability.

Tesseract OCR engine

Tesseract is an open‑source OCR engine originally developed by HP Labs, open‑sourced in 2005 and now maintained by Google. It is one of the most accurate and widely used OCR tools.

Tess4j Java wrapper

Tess4j provides a Java API that wraps the Tesseract engine, enabling Java applications to invoke OCR functions directly. Its main characteristics are:

Easy integration : simple API for Java projects.

Cross‑platform : runs on any OS with a Java runtime (Windows, macOS, Linux).

Rich functionality : supports all major Tesseract features, including multi‑language recognition and custom training.

Active community : open‑source project with regular updates.

Download language data

Language data files are required by Tesseract. They can be obtained from the official repository:

https://github.com/tesseract-ocr/tessdata

For Chinese, download chi_sim.traineddata; for English, download eng.traineddata. Place the files in a local directory, e.g. F:/HeiMaTouTiao/tessdata, and note the absolute path.

Chinese traineddata download
Chinese traineddata download
English traineddata download
English traineddata download

Integrate Tess4j into a Spring Boot project

Environment: JDK 17.0.7 + Spring Boot 3.0.2.

Add Maven dependency

<dependency>
  <groupId>net.sourceforge.tess4j</groupId>
  <artifactId>tess4j</artifactId>
  <version>4.1.1</version>
</dependency>

Configure application.yml

server:
  port: 11014

tess4j:
  data-path: F:/HeiMaTouTiao/tessdata
  chinese-train-data: chi_sim
  english-train-data: eng

Create configuration class

import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

@Configuration
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jConfiguration {
    private String dataPath;
    private String chineseTrainData;
    private String englishTrainData;
    // getters and setters omitted for brevity
}

Test image recognition

Chinese OCR

import cn.edu.scau.config.Tess4jConfiguration;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.File;

@SpringBootTest
public class Tess4jApplicationTests {
    @Autowired
    private Tess4jConfiguration tess4jConfiguration;

    @Test
    public void testChinese() throws TesseractException {
        long start = System.currentTimeMillis();
        ITesseract iTesseract = new Tesseract();
        iTesseract.setDatapath(tess4jConfiguration.getDataPath());
        iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
        File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-Chinese.png");
        String result = iTesseract.doOCR(file);
        long end = System.currentTimeMillis();
        System.err.println("耗时:" + (end - start) + "ms");
        System.out.println(result);
    }
}

The test prints the OCR result and the elapsed time, demonstrating that OCR can be time‑consuming.

Chinese OCR timing
Chinese OCR timing

English OCR

@Test
public void testEnglish() throws TesseractException {
    long start = System.currentTimeMillis();
    ITesseract iTesseract = new Tesseract();
    iTesseract.setDatapath(tess4jConfiguration.getDataPath());
    iTesseract.setLanguage(tess4jConfiguration.getEnglishTrainData());
    File file = new File("F:/HeiMaTouTiao/tessdata/CaiXuKun-English.png");
    String result = iTesseract.doOCR(file);
    long end = System.currentTimeMillis();
    System.err.println("耗时:" + (end - start) + "ms");
    System.out.println(result);
}

Chinese‑English mixed OCR

When the image contains both Chinese and English, the Chinese traineddata must be used.

@Test
public void testChineseAndEnglish() throws TesseractException {
    long start = System.currentTimeMillis();
    ITesseract iTesseract = new Tesseract();
    iTesseract.setDatapath(tess4jConfiguration.getDataPath());
    iTesseract.setLanguage(tess4jConfiguration.getChineseTrainData());
    File file = new File("F:/HeiMaTouTiao/tessdata/ParagraphWithChineseAndEnglish.png");
    String result = iTesseract.doOCR(file);
    long end = System.currentTimeMillis();
    System.err.println("耗时:" + (end - start) + "ms");
    System.out.println(result);
}

Important notes

Language data files must have the .traineddata suffix and the filename prefix (e.g., chi_sim, eng) must match the value set in the Java configuration.

Traineddata naming
Traineddata naming
JavaConfigurationOCRSpringBootImage RecognitionTesseractTess4J
SpringMeng
Written by

SpringMeng

Focused on software development, sharing source code and tutorials for various systems.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.