Backend Development 5 min read

How to Parse Documents in Spring Boot with Apache Tika

Learn how to integrate Apache Tika into a Spring Boot application to parse a wide range of document formats, including the necessary Maven dependencies, XML configuration, custom configuration class, and usage examples, enabling efficient content extraction and processing within your Java backend.

Java High-Performance Architecture

Jun 7, 2024

How to Parse Documents in Spring Boot with Apache Tika

Apache Tika is an open-source library that can detect and extract content from over a thousand file types such as PPT, XLS, PDF, etc. It can be used via a graphical UI, a server, or embedded in a Java project.

Add Maven Dependencies

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-bom</artifactId>
      <version>2.8.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
</dependency>

Create tika-config.xml

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <encodingDetectors>
    <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
      <params>
        <param name="markLimit" type="int">64000</param>
      </params>
    </encodingDetector>
    <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
      <params>
        <param name="markLimit" type="int">64001</param>
      </params>
    </encodingDetector>
    <encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
      <params>
        <param name="markLimit" type="int">64002</param>
      </params>
    </encodingDetector>
  </encodingDetectors>
</properties>

Define a Spring configuration class

import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.Resource;
import org.springframework.core.io.ResourceLoader;
import org.xml.sax.SAXException;

@Configuration
public class MyTikaConfig {

    @Autowired
    private ResourceLoader resourceLoader;

    @Bean
    public Tika tika() throws TikaException, IOException, SAXException {
        Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
        InputStream inputStream = resource.getInputStream();

        TikaConfig config = new TikaConfig(inputStream);
        Detector detector = config.getDetector();
        Parser autoDetectParser = new AutoDetectParser(config);

        return new Tika(detector, autoDetectParser);
    }
}

With the Tika bean injected, you can call its detect, parse, or translate methods to extract text and metadata from supported documents directly in your Spring Boot services.

Run the example

The following screenshot shows the result of parsing a sample file after the configuration is complete.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Backend Development spring-boot Apache Tika Document Parsing

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.