Integrating Apache Tika into a Spring Boot Application for Document Parsing
This guide shows how to integrate Apache Tika into a Spring Boot application, covering Maven dependencies, XML configuration, a Spring @Configuration class, and usage of Tika’s detection and parsing APIs for processing various document formats.
Apache Tika is an open‑source document parsing library that can extract content from over a thousand file formats, and it can be used via a GUI, a server, or embedded in applications.
This article demonstrates how to integrate Tika into a Spring Boot project, starting with the required Maven dependencies.
First, add the Tika BOM and the core and parser packages in the dependencyManagement section and as regular dependencies:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.8.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
</dependency>Next, place a tika-config.xml file under src/main/resources with custom encoding detectors configuration:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<encodingDetectors>
<encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
<params>
<param name="markLimit" type="int">64000</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
<params>
<param name="markLimit" type="int">64001</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
<params>
<param name="markLimit" type="int">64002</param>
</params>
</encodingDetector>
</encodingDetectors>
</properties>Then create a Spring configuration class MyTikaConfig that loads the XML, builds a TikaConfig , obtains a Detector and an AutoDetectParser , and registers a Tika bean for injection:
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.Resource;
import org.springframework.core.io.ResourceLoader;
import org.xml.sax.SAXException;
/**
* Tika configuration class
*/
@Configuration
public class MyTikaConfig {
@Autowired
private ResourceLoader resourceLoader;
@Bean
public Tika tika() throws TikaException, IOException, SAXException {
Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
InputStream inputStream = resource.getInputStream();
TikaConfig config = new TikaConfig(inputStream);
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
return new Tika(detector, autoDetectParser);
}
}With the Tika bean injected, the application can call Tika’s detect , translate , and parse methods to process documents.
An example usage screenshot is shown below:
Finally, the author invites readers to like, share, and follow the article, and promotes a paid knowledge community for deeper learning.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.