Backend Development 10 min read

Master Document Parsing in Spring Boot 3 with Apache Tika: Code Samples & Tips

This article introduces Apache Tika for document parsing, outlines its key advantages, and provides step‑by‑step Spring Boot 3 examples—including facade parsing, text, PDF, auto‑detect, HTML conversion, custom configuration, and file‑upload integration—complete with code snippets and output screenshots.

Spring Full-Stack Practical Cases

Oct 31, 2024

Master Document Parsing in Spring Boot 3 with Apache Tika: Code Samples & Tips

1. Introduction

Document parsing is widely used in modern enterprises and development, especially when extracting valuable information from various file formats. As digital transformation accelerates, organizations rely on automation tools to handle massive document data. Apache Tika is a powerful open‑source library for extracting text and metadata from many file types, and Spring AI integrates Tika as a document parser.

Using Tika simplifies document processing workflows and improves accuracy and efficiency.

Advantages of Tika

Broad format support (over 1000 types, including DOCX, XLSX, PPTX, PDF, HTML, audio, video, images)

Easy integration via a simple Java API, suitable for any Java or Spring Boot application

Content and metadata extraction (title, author, creation date, etc.)

Built‑in NLP features such as language detection and term‑frequency statistics

Batch processing and automation for large‑scale document handling

Cross‑platform compatibility as a pure Java library

Active community support from the Apache Foundation

Security features that guard against malicious content (e.g., XSS) and handle encrypted documents

Extensibility through plugins and a modular architecture

Lightweight footprint without complex dependencies

2. Practical Cases

2.1 Using Tika Facade

Parse a Word document to plain text.

public static String parseToString() throws Exception {
    Tika tika = new Tika();
    try (InputStream stream = new FileInputStream(new File("e:\\technology.docx"))) {
        return tika.parseToString(stream);
    }
}

Result:

2.2 Parsing Text Files

Use TXTParser with a handler and metadata.

TXTParser parser = new TXTParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (InputStream stream = new FileInputStream(new File("C:\\execute script.txt"))) {
    parser.parse(stream, handler, metadata, context);
}
System.out.println(handler.toString());
System.out.println(metadata.toString());

Result:

2.3 Parsing PDF Documents

PDFParser parser = new PDFParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try (InputStream stream = new FileInputStream(new File("D:\\setups\\ReferenceCard.pdf"))) {
    parser.parse(stream, handler, metadata, context);
}
System.out.println(handler.toString());
System.out.println(metadata.toString());

Result:

2.4 AutoDetectParser

Automatically detects the document type and delegates to the appropriate parser.

public static String parseAutoDetect() throws Exception {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    try (InputStream stream = new FileInputStream(new File("e:\\technology.docx"))) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}

2.5 Converting to HTML

Use ToXMLContentHandler to obtain XHTML content.

public static String parserToXHTML() throws Exception {
    ToXMLContentHandler handler = new ToXMLContentHandler();
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = new FileInputStream(new File("e:\\technology.docx"))) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}

Result:

2.6 Customizing Tika

Control which parsers are used and their priority via tika-config.xml.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <!-- Exclude PDF parsing -->
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>application/pdf</mime-exclude>
    </parser>
  </parsers>
</properties>

2.7 Integration with Spring Boot

Define a bean for AutoDetectParser with a fallback parser, then expose a REST controller to upload files and return extracted text.

@Bean
Parser parser() {
    AutoDetectParser parser = new AutoDetectParser();
    parser.setFallback(new TXTParser());
    return parser;
}

@RestController
@RequestMapping("/tika")
public class TikaController {
    private final Parser parser;
    public TikaController(Parser parser) { this.parser = parser; }

    @PostMapping("/upload")
    public String upload(MultipartFile file) throws Exception {
        InputStream stream = file.getInputStream();
        BodyContentHandler handler = new BodyContentHandler();
        parser.parse(stream, handler, new Metadata(), new ParseContext());
        return handler.toString();
    }
}

Invoke the endpoint with Postman; the response contains the parsed document content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Spring Boot file upload Apache Tika Document Parsing AutoDetectParser

Written by

Spring Full-Stack Practical Cases

Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.