Backend Development 6 min read

Integrating Apache Tika into a Spring Boot Application for Document Parsing

This guide shows how to integrate Apache Tika into a Spring Boot application, covering Maven dependencies, XML configuration, a Spring @Configuration class, and usage of Tika’s detection and parsing APIs for processing various document formats.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Integrating Apache Tika into a Spring Boot Application for Document Parsing

Apache Tika is an open‑source document parsing library that can extract content from over a thousand file formats, and it can be used via a GUI, a server, or embedded in applications.

This article demonstrates how to integrate Tika into a Spring Boot project, starting with the required Maven dependencies.

First, add the Tika BOM and the core and parser packages in the dependencyManagement section and as regular dependencies:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-bom</artifactId>
      <version>2.8.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
</dependency>

Next, place a tika-config.xml file under src/main/resources with custom encoding detectors configuration:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <encodingDetectors>
        <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
            <params>
                <param name="markLimit" type="int">64000</param>
            </params>
        </encodingDetector>
        <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
            <params>
                <param name="markLimit" type="int">64001</param>
            </params>
        </encodingDetector>
        <encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
            <params>
                <param name="markLimit" type="int">64002</param>
            </params>
        </encodingDetector>
    </encodingDetectors>
</properties>

Then create a Spring configuration class MyTikaConfig that loads the XML, builds a TikaConfig , obtains a Detector and an AutoDetectParser , and registers a Tika bean for injection:

import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.Resource;
import org.springframework.core.io.ResourceLoader;
import org.xml.sax.SAXException;

/**
 * Tika configuration class
 */
@Configuration
public class MyTikaConfig {

    @Autowired
    private ResourceLoader resourceLoader;

    @Bean
    public Tika tika() throws TikaException, IOException, SAXException {
        Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
        InputStream inputStream = resource.getInputStream();
        TikaConfig config = new TikaConfig(inputStream);
        Detector detector = config.getDetector();
        Parser autoDetectParser = new AutoDetectParser(config);
        return new Tika(detector, autoDetectParser);
    }
}

With the Tika bean injected, the application can call Tika’s detect , translate , and parse methods to process documents.

An example usage screenshot is shown below:

Finally, the author invites readers to like, share, and follow the article, and promotes a paid knowledge community for deeper learning.

Javabackend developmentSpring BootApache TikaDocument Parsing
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.