Integrating Apache Tika with Spring Boot for Document Parsing
This article demonstrates how to add Apache Tika dependencies to a Spring Boot project, configure tika-config.xml, create a Java configuration class, and use the injected Tika bean to detect, translate, and parse various document formats such as PDF, PPT, and XLS.
Apache Tika is an open‑source Apache project that can parse and extract content from more than a thousand file types (e.g., PPT, XLS, PDF). It can be used via a graphical UI (tika‑app), a server (tika‑server), or embedded directly in a Java application.
The tutorial shows how to add the required Maven dependencies to a Spring Boot project using <dependencyManagement>...</dependencyManagement> and individual Tika artifacts:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.8.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
</dependency>A tika-config.xml file is placed in the resources directory to customize encoding detectors. The example configuration defines three detectors (HtmlEncodingDetector, UniversalEncodingDetector, Icu4jEncodingDetector) with specific markLimit parameters:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<encodingDetectors>
<encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
<params>
<param name="markLimit" type="int">64000</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
<params>
<param name="markLimit" type="int">64001</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
<params>
<param name="markLimit" type="int">64002</param>
</params>
</encodingDetector>
</encodingDetectors>
</properties>The Java configuration class MyTikaConfig loads this XML, creates a TikaConfig , and registers a Tika bean that combines a detector and an AutoDetectParser :
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.Resource;
import org.springframework.core.io.ResourceLoader;
import org.xml.sax.SAXException;
/**
* Tika configuration class
*/
@Configuration
public class MyTikaConfig {
@Autowired
private ResourceLoader resourceLoader;
@Bean
public Tika tika() throws TikaException, IOException, SAXException {
Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
InputStream inputStream = resource.getInputStream();
TikaConfig config = new TikaConfig(inputStream);
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
return new Tika(detector, autoDetectParser);
}
}With the Tika bean injected, the application can call its detect , translate , and parse methods to process documents of various formats directly within the Spring Boot service.
An illustration (not reproduced here) shows how the bean is used in the project after configuration.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.