Parse 1000+ Document Formats in Spring Boot with Apache Tika in Just 20 Lines

This article shows how to integrate Apache Tika into a Spring Boot application, enabling automatic detection and extraction of text and metadata from over a thousand file formats with only a few configuration steps and concise Java code.

Selected Java Interview Questions
Selected Java Interview Questions
Selected Java Interview Questions
Parse 1000+ Document Formats in Spring Boot with Apache Tika in Just 20 Lines

Why a Unified Document Parser?

Projects often need to handle PDFs, Word, Excel, and many other formats, each requiring a different library (iText, PDFBox, POI, etc.). Maintaining separate code paths leads to duplicated logic, higher maintenance cost, and bugs when users upload unexpected file types.

Introducing Apache Tika

Apache Tika is an Apache‑licensed open‑source library that can parse more than 1,000 file formats. Its core capabilities are:

Automatic file‑type detection – reads the file content instead of relying on the extension.

Text extraction – returns the plain text of PDFs, Word documents, Excel sheets, and many others.

Metadata extraction – provides author, creation date, modification date, and other properties.

The library exposes a simple API that hides the complexity of dealing with many parsers.

Spring Boot Integration (Three Simple Steps)

1. Add Maven Dependencies

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.9.2</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>2.9.2</version>
</dependency>

The first artifact provides the core detection engine; the second bundles parsers for common formats such as PDF, DOCX, XLSX, etc.

2. Create tika-config.xml

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <encodingDetectors>
        <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
            <params>
                <param name="markLimit" type="int">64000</param>
            </params>
        </encodingDetector>
        <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
            <params>
                <param name="markLimit" type="int">64001</param>
            </params>
        </encodingDetector>
    </encodingDetectors>
</properties>

This configuration registers two detectors: one for HTML files and a generic detector for most other text formats.

3. Register a Tika Bean

@Configuration
public class TikaConfig {
    @Autowired
    private ResourceLoader resourceLoader;

    @Bean
    public Tika tika() throws Exception {
        Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
        InputStream inputStream = resource.getInputStream();
        org.apache.tika.config.TikaConfig config = new org.apache.tika.config.TikaConfig(inputStream);
        Detector detector = config.getDetector();
        Parser parser = new AutoDetectParser(config);
        return new Tika(detector, parser);
    }
}

The bean reads the XML, builds a TikaConfig, obtains a detector and an auto‑detect parser, and finally returns a ready‑to‑use Tika instance that can be injected elsewhere.

Practical Demonstration

Service Layer

@Service
public class FileParserService {
    @Autowired
    private Tika tika;

    public String parseFile(String filePath) {
        try {
            File file = new File(filePath);
            return tika.parseToString(file);
        } catch (IOException e) {
            return "Parse failed: " + e.getMessage();
        }
    }
}

The parseFile method simply delegates to tika.parseToString, which automatically detects the file type and extracts its textual content.

REST Controller

@RestController
@RequestMapping("/api/parse")
public class FileParseController {
    @Autowired
    private FileParserService fileParserService;

    @PostMapping("/upload")
    public String uploadFile(@RequestParam("file") MultipartFile file) {
        try {
            String tempPath = "/tmp/" + file.getOriginalFilename();
            file.transferTo(new File(tempPath));
            return fileParserService.parseFile(tempPath);
        } catch (Exception e) {
            return "Processing failed: " + e.getMessage();
        }
    }
}

The endpoint accepts any uploaded file, stores it temporarily, and returns the extracted text without any format‑specific branching logic.

Metadata Extraction Example

public Map<String, String> parseMetadata(String filePath) {
    try {
        File file = new File(filePath);
        Metadata metadata = new Metadata();
        Parser parser = new AutoDetectParser();
        ContentHandler handler = new BodyContentHandler();
        parser.parse(new FileInputStream(file), handler, metadata, new ParseContext());
        Map<String, String> result = new HashMap<>();
        for (String name : metadata.names()) {
            result.put(name, metadata.get(name));
        }
        return result;
    } catch (Exception e) {
        return Collections.emptyMap();
    }
}

This method returns a map of all metadata fields such as Author, Creation‑Date, and Last‑Modified.

Typical Use Cases

Document Management Systems – extract text from every uploaded document to build a searchable index.

Data Analysis Pipelines – convert heterogeneous source files into a uniform text stream for downstream processing.

Sensitive Information Detection – feed extracted text into regex‑based scanners for IDs, credit‑card numbers, etc.

Content Migration – pull both content and metadata from legacy documents before moving them to a new platform.

Conclusion

Apache Tika provides a single, consistent way to handle thousands of document formats. By adding two Maven dependencies, a tiny XML configuration, and a Spring bean, developers can replace a multitude of format‑specific libraries with a few lines of code. The resulting parseToString and metadata APIs keep the codebase clean, maintainable, and ready for a variety of real‑world scenarios.

backendJavaFile UploadSpringBootApache TikaDocument Parsing
Selected Java Interview Questions
Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.