How to Parse Documents in Spring Boot with Apache Tika

Learn how to integrate Apache Tika into a Spring Boot application to parse a wide range of document formats, including the necessary Maven dependencies, XML configuration, custom configuration class, and usage examples, enabling efficient content extraction and processing within your Java backend.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
How to Parse Documents in Spring Boot with Apache Tika

Apache Tika is an open-source library that can detect and extract content from over a thousand file types such as PPT, XLS, PDF, etc. It can be used via a graphical UI, a server, or embedded in a Java project.

Add Maven Dependencies

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-bom</artifactId>
      <version>2.8.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
</dependency>

Create tika-config.xml

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <encodingDetectors>
    <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
      <params>
        <param name="markLimit" type="int">64000</param>
      </params>
    </encodingDetector>
    <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
      <params>
        <param name="markLimit" type="int">64001</param>
      </params>
    </encodingDetector>
    <encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
      <params>
        <param name="markLimit" type="int">64002</param>
      </params>
    </encodingDetector>
  </encodingDetectors>
</properties>

Define a Spring configuration class

import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.Resource;
import org.springframework.core.io.ResourceLoader;
import org.xml.sax.SAXException;

@Configuration
public class MyTikaConfig {

    @Autowired
    private ResourceLoader resourceLoader;

    @Bean
    public Tika tika() throws TikaException, IOException, SAXException {
        Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
        InputStream inputStream = resource.getInputStream();

        TikaConfig config = new TikaConfig(inputStream);
        Detector detector = config.getDetector();
        Parser autoDetectParser = new AutoDetectParser(config);

        return new Tika(detector, autoDetectParser);
    }
}

With the Tika bean injected, you can call its detect, parse, or translate methods to extract text and metadata from supported documents directly in your Spring Boot services.

Run the example

The following screenshot shows the result of parsing a sample file after the configuration is complete.

JavaBackend DevelopmentSpring BootApache TikaDocument Parsing
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.