Parse 1000+ Document Formats in Spring Boot with Apache Tika in Just 20 Lines
This article shows how to integrate Apache Tika into a Spring Boot application, enabling automatic detection and extraction of text and metadata from over a thousand file formats with only a few configuration steps and concise Java code.
Why a Unified Document Parser?
Projects often need to handle PDFs, Word, Excel, and many other formats, each requiring a different library (iText, PDFBox, POI, etc.). Maintaining separate code paths leads to duplicated logic, higher maintenance cost, and bugs when users upload unexpected file types.
Introducing Apache Tika
Apache Tika is an Apache‑licensed open‑source library that can parse more than 1,000 file formats. Its core capabilities are:
Automatic file‑type detection – reads the file content instead of relying on the extension.
Text extraction – returns the plain text of PDFs, Word documents, Excel sheets, and many others.
Metadata extraction – provides author, creation date, modification date, and other properties.
The library exposes a simple API that hides the complexity of dealing with many parsers.
Spring Boot Integration (Three Simple Steps)
1. Add Maven Dependencies
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.9.2</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.9.2</version>
</dependency>The first artifact provides the core detection engine; the second bundles parsers for common formats such as PDF, DOCX, XLSX, etc.
2. Create tika-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<encodingDetectors>
<encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
<params>
<param name="markLimit" type="int">64000</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
<params>
<param name="markLimit" type="int">64001</param>
</params>
</encodingDetector>
</encodingDetectors>
</properties>This configuration registers two detectors: one for HTML files and a generic detector for most other text formats.
3. Register a Tika Bean
@Configuration
public class TikaConfig {
@Autowired
private ResourceLoader resourceLoader;
@Bean
public Tika tika() throws Exception {
Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
InputStream inputStream = resource.getInputStream();
org.apache.tika.config.TikaConfig config = new org.apache.tika.config.TikaConfig(inputStream);
Detector detector = config.getDetector();
Parser parser = new AutoDetectParser(config);
return new Tika(detector, parser);
}
}The bean reads the XML, builds a TikaConfig, obtains a detector and an auto‑detect parser, and finally returns a ready‑to‑use Tika instance that can be injected elsewhere.
Practical Demonstration
Service Layer
@Service
public class FileParserService {
@Autowired
private Tika tika;
public String parseFile(String filePath) {
try {
File file = new File(filePath);
return tika.parseToString(file);
} catch (IOException e) {
return "Parse failed: " + e.getMessage();
}
}
}The parseFile method simply delegates to tika.parseToString, which automatically detects the file type and extracts its textual content.
REST Controller
@RestController
@RequestMapping("/api/parse")
public class FileParseController {
@Autowired
private FileParserService fileParserService;
@PostMapping("/upload")
public String uploadFile(@RequestParam("file") MultipartFile file) {
try {
String tempPath = "/tmp/" + file.getOriginalFilename();
file.transferTo(new File(tempPath));
return fileParserService.parseFile(tempPath);
} catch (Exception e) {
return "Processing failed: " + e.getMessage();
}
}
}The endpoint accepts any uploaded file, stores it temporarily, and returns the extracted text without any format‑specific branching logic.
Metadata Extraction Example
public Map<String, String> parseMetadata(String filePath) {
try {
File file = new File(filePath);
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
parser.parse(new FileInputStream(file), handler, metadata, new ParseContext());
Map<String, String> result = new HashMap<>();
for (String name : metadata.names()) {
result.put(name, metadata.get(name));
}
return result;
} catch (Exception e) {
return Collections.emptyMap();
}
}This method returns a map of all metadata fields such as Author, Creation‑Date, and Last‑Modified.
Typical Use Cases
Document Management Systems – extract text from every uploaded document to build a searchable index.
Data Analysis Pipelines – convert heterogeneous source files into a uniform text stream for downstream processing.
Sensitive Information Detection – feed extracted text into regex‑based scanners for IDs, credit‑card numbers, etc.
Content Migration – pull both content and metadata from legacy documents before moving them to a new platform.
Conclusion
Apache Tika provides a single, consistent way to handle thousands of document formats. By adding two Maven dependencies, a tiny XML configuration, and a Spring bean, developers can replace a multitude of format‑specific libraries with a few lines of code. The resulting parseToString and metadata APIs keep the codebase clean, maintainable, and ready for a variety of real‑world scenarios.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
