Backend Development 7 min read

How to Parse PDFs and Extract Metadata with Apache Tika and Spring Boot

This guide explains Apache Tika's document parsing capabilities, shows how to download and run the Tika app, demonstrates extracting text and metadata from a PDF, and provides step‑by‑step instructions for integrating Tika into a Spring Boot project with full code examples.

Lobster Programming

Nov 1, 2024

How to Parse PDFs and Extract Metadata with Apache Tika and Spring Boot

Apache Tika Overview

Apache Tika is an open‑source document parsing library that can extract content and metadata from more than a thousand file types, including PPT, XLS, PDF, TXT, DOC, etc. For images and videos it only retrieves metadata.

Key components are:

Parser – extracts document content.

Language Detector – identifies the language of the text.

Metadata Extractor – pulls metadata from the file.

Running the Tika Application

Download the Tika app JAR from a mirror:

https://mirrors.tuna.tsinghua.edu.cn/apache/tika/2.9.2/tika-app-2.9.2.jar

Run the JAR with Java: java -jar tika-app-2.9.2.jar The graphical UI appears, allowing you to upload files and view extracted text and metadata.

After uploading a simple test PDF, Tika displays the extracted metadata and formatted text, as well as a JSON representation of the metadata.

Integrating Tika with Spring Boot

Add the required Maven dependencies:

<dependencies>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-bom</artifactId>
        <version>2.8.0</version>
    </dependency>
    
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.6.0</version>
    </dependency>
    
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers-standard-package</artifactId>
        <version>2.6.0</version>
    </dependency>
    
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
</dependencies>

Create a tika-config.xml file (placed in src/main/resources) to configure encoding detectors:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <encodingDetectors>
        <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
            <params>
                <param name="markLimit" type="int">64000</param>
            </params>
        </encodingDetector>
        <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
            <params>
                <param name="markLimit" type="int">64001</param>
            </params>
        </encodingDetector>
        <encodingDetector class="org.apache.tika.parser.txt.Icu4iEncodingDetector">
            <params>
                <param name="markLimit" type="int">64002</param>
            </params>
        </encodingDetector>
    </encodingDetectors>
</properties>

Configure a Spring bean that loads this XML and creates a Tika instance:

@Configuration
public class MyTikaConfig {

    @Resource
    private ResourceLoader resourceLoader;

    @Bean
    public Tika tika() throws TikaException, IOException, SAXException {
        org.springframework.core.io.Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
        InputStream inputstream = resource.getInputStream();
        TikaConfig config = new TikaConfig(inputstream);
        Detector detector = config.getDetector();
        Parser autoDetectParser = new AutoDetectParser(config);
        return new Tika(detector, autoDetectParser);
    }
}

Expose a REST endpoint to parse a PDF file:

@RestController
@RequestMapping("/parse")
public class TikaController {

    @Resource
    private Tika tika;

    @GetMapping("/pdf")
    public String parsePdf() throws TikaException, IOException {
        String filePath = "C:\\Users\\1\\Desktop" + File.separator + "test.pdf";
        File file = new File(filePath);
        String data = tika.parseToString(file);
        System.out.println("pdf文件中的内容为:" + data);
        return data;
    }
}

Start the Spring Boot application and request http://localhost:8081/parse/pdf. The response contains the extracted text of the PDF, and you can similarly retrieve metadata.

In summary, Apache Tika can detect and extract both text content and metadata from a wide range of document formats, making it valuable for search engines, content analysis, and translation pipelines, and it integrates smoothly into Java‑based backend services such as Spring Boot.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Spring Boot Apache Tika Document Processing PDF parsing Metadata Extraction

Written by

Lobster Programming

Sharing insights on technical analysis and exchange, making life better through technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.