Backend Development 5 min read

Integrating Apache Tika with Spring Boot for Document Parsing

This guide demonstrates how to add Apache Tika to a Spring Boot project by declaring the tika‑bom, core and parser dependencies, providing a custom tika‑config.xml, creating a @Configuration class that builds a Tika bean, and then injecting the bean to detect, parse, or translate documents.

Java Tech Enthusiast
Java Tech Enthusiast
Java Tech Enthusiast
Integrating Apache Tika with Spring Boot for Document Parsing

Apache Tika is an open‑source library that can detect and extract content from over a thousand file formats. This guide shows how to integrate Tika into a Spring Boot application for document parsing.

Dependency

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-bom</artifactId>
      <version>2.8.0</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
</dependency>
<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers-standard-package</artifactId>
</dependency>

Configuration file (tika-config.xml)

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <encodingDetectors>
    <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
      <params>
        <param name="markLimit" type="int">64000</param>
      </params>
    </encodingDetector>
    <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
      <params>
        <param name="markLimit" type="int">64001</param>
      </params>
    </encodingDetector>
    <encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
      <params>
        <param name="markLimit" type="int">64002</param>
      </params>
    </encodingDetector>
  </encodingDetectors>
</properties>

Configuration class (MyTikaConfig)

import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.Resource;
import org.springframework.core.io.ResourceLoader;
import org.xml.sax.SAXException;

/**
 * tika configuration class
 */
@Configuration
public class MyTikaConfig {

    @Autowired
    private ResourceLoader resourceLoader;

    @Bean
    public Tika tika() throws TikaException, IOException, SAXException {
        Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
        InputStream inputStream = resource.getInputStream();

        TikaConfig config = new TikaConfig(inputStream);
        Detector detector = config.getDetector();
        Parser autoDetectParser = new AutoDetectParser(config);

        return new Tika(detector, autoDetectParser);
    }
}

After the bean is configured, inject Tika into your services and use its detect , parse or translate methods to process documents.

JavaconfigurationSpring BootApache TikaDocument Parsing
Java Tech Enthusiast
Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.