How to Parse PDFs and Extract Metadata with Apache Tika and Spring Boot
This guide explains Apache Tika's document parsing capabilities, shows how to download and run the Tika app, demonstrates extracting text and metadata from a PDF, and provides step‑by‑step instructions for integrating Tika into a Spring Boot project with full code examples.
Apache Tika Overview
Apache Tika is an open‑source document parsing library that can extract content and metadata from more than a thousand file types, including PPT, XLS, PDF, TXT, DOC, etc. For images and videos it only retrieves metadata.
Key components are:
Parser – extracts document content.
Language Detector – identifies the language of the text.
Metadata Extractor – pulls metadata from the file.
Running the Tika Application
Download the Tika app JAR from a mirror:
<code>https://mirrors.tuna.tsinghua.edu.cn/apache/tika/2.9.2/tika-app-2.9.2.jar</code>Run the JAR with Java:
<code>java -jar tika-app-2.9.2.jar</code>The graphical UI appears, allowing you to upload files and view extracted text and metadata.
After uploading a simple test PDF, Tika displays the extracted metadata and formatted text, as well as a JSON representation of the metadata.
Integrating Tika with Spring Boot
Add the required Maven dependencies:
<code><dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
</dependencies></code>Create a tika-config.xml file (placed in src/main/resources ) to configure encoding detectors:
<code><?xml version="1.0" encoding="UTF-8"?>
<properties>
<encodingDetectors>
<encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
<params>
<param name="markLimit" type="int">64000</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
<params>
<param name="markLimit" type="int">64001</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.Icu4iEncodingDetector">
<params>
<param name="markLimit" type="int">64002</param>
</params>
</encodingDetector>
</encodingDetectors>
</properties></code>Configure a Spring bean that loads this XML and creates a Tika instance:
<code>@Configuration
public class MyTikaConfig {
@Resource
private ResourceLoader resourceLoader;
@Bean
public Tika tika() throws TikaException, IOException, SAXException {
org.springframework.core.io.Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
InputStream inputstream = resource.getInputStream();
TikaConfig config = new TikaConfig(inputstream);
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
return new Tika(detector, autoDetectParser);
}
}
</code>Expose a REST endpoint to parse a PDF file:
<code>@RestController
@RequestMapping("/parse")
public class TikaController {
@Resource
private Tika tika;
@GetMapping("/pdf")
public String parsePdf() throws TikaException, IOException {
String filePath = "C:\\Users\\1\\Desktop" + File.separator + "test.pdf";
File file = new File(filePath);
String data = tika.parseToString(file);
System.out.println("pdf文件中的内容为:" + data);
return data;
}
}
</code>Start the Spring Boot application and request http://localhost:8081/parse/pdf . The response contains the extracted text of the PDF, and you can similarly retrieve metadata.
In summary, Apache Tika can detect and extract both text content and metadata from a wide range of document formats, making it valuable for search engines, content analysis, and translation pipelines, and it integrates smoothly into Java‑based backend services such as Spring Boot.
Lobster Programming
Sharing insights on technical analysis and exchange, making life better through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.