Master Document Parsing in Spring Boot 3 with Apache Tika: Code Samples & Tips
This article introduces Apache Tika for document parsing, outlines its key advantages, and provides step‑by‑step Spring Boot 3 examples—including facade parsing, text, PDF, auto‑detect, HTML conversion, custom configuration, and file‑upload integration—complete with code snippets and output screenshots.
1. Introduction
Document parsing is widely used in modern enterprises and development, especially when extracting valuable information from various file formats. As digital transformation accelerates, organizations rely on automation tools to handle massive document data. Apache Tika is a powerful open‑source library for extracting text and metadata from many file types, and Spring AI integrates Tika as a document parser.
Using Tika simplifies document processing workflows and improves accuracy and efficiency.
Advantages of Tika
Broad format support (over 1000 types, including DOCX, XLSX, PPTX, PDF, HTML, audio, video, images)
Easy integration via a simple Java API, suitable for any Java or Spring Boot application
Content and metadata extraction (title, author, creation date, etc.)
Built‑in NLP features such as language detection and term‑frequency statistics
Batch processing and automation for large‑scale document handling
Cross‑platform compatibility as a pure Java library
Active community support from the Apache Foundation
Security features that guard against malicious content (e.g., XSS) and handle encrypted documents
Extensibility through plugins and a modular architecture
Lightweight footprint without complex dependencies
2. Practical Cases
2.1 Using Tika Facade
Parse a Word document to plain text.
<code>public static String parseToString() throws Exception {
Tika tika = new Tika();
try (InputStream stream = new FileInputStream(new File("e:\\technology.docx"))) {
return tika.parseToString(stream);
}
}
</code>Result:
2.2 Parsing Text Files
Use TXTParser with a handler and metadata.
<code>TXTParser parser = new TXTParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (InputStream stream = new FileInputStream(new File("C:\\execute script.txt"))) {
parser.parse(stream, handler, metadata, context);
}
System.out.println(handler.toString());
System.out.println(metadata.toString());
</code>Result:
2.3 Parsing PDF Documents
<code>PDFParser parser = new PDFParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (InputStream stream = new FileInputStream(new File("D:\\setups\\ReferenceCard.pdf"))) {
parser.parse(stream, handler, metadata, context);
}
System.out.println(handler.toString());
System.out.println(metadata.toString());
</code>Result:
2.4 AutoDetectParser
Automatically detects the document type and delegates to the appropriate parser.
<code>public static String parseAutoDetect() throws Exception {
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try (InputStream stream = new FileInputStream(new File("e:\\technology.docx"))) {
parser.parse(stream, handler, metadata);
return handler.toString();
}
}
</code>2.5 Converting to HTML
Use ToXMLContentHandler to obtain XHTML content.
<code>public static String parserToXHTML() throws Exception {
ToXMLContentHandler handler = new ToXMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = new FileInputStream(new File("e:\\technology.docx"))) {
parser.parse(stream, handler, metadata);
return handler.toString();
}
}
</code>Result:
2.6 Customizing Tika
Control which parsers are used and their priority via tika-config.xml .
<code><?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<!-- Exclude PDF parsing -->
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>application/pdf</mime-exclude>
</parser>
</parsers>
</properties>
</code>2.7 Integration with Spring Boot
Define a bean for AutoDetectParser with a fallback parser, then expose a REST controller to upload files and return extracted text.
<code>@Bean
Parser parser() {
AutoDetectParser parser = new AutoDetectParser();
parser.setFallback(new TXTParser());
return parser;
}
@RestController
@RequestMapping("/tika")
public class TikaController {
private final Parser parser;
public TikaController(Parser parser) { this.parser = parser; }
@PostMapping("/upload")
public String upload(MultipartFile file) throws Exception {
InputStream stream = file.getInputStream();
BodyContentHandler handler = new BodyContentHandler();
parser.parse(stream, handler, new Metadata(), new ParseContext());
return handler.toString();
}
}
</code>Invoke the endpoint with Postman; the response contains the parsed document content.
Spring Full-Stack Practical Cases
Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.