Unlock Apache Tika: Extract Text, Metadata, and Detect Sensitive Data in Java
This article introduces Apache Tika, a powerful Java library for parsing many file formats, extracting text and metadata, performing OCR and language detection, and shows how to integrate it with Spring Boot to automatically detect sensitive information such as ID numbers, credit cards, and phone numbers.
Apache Tika Overview
Apache Tika is a powerful content analysis toolkit that can extract text, metadata, and other structured information from a wide variety of file formats.
Supported Formats
Office Documents : Word (.doc, .docx), Excel (.xls, .xlsx), PowerPoint (.ppt, .pptx), OpenOffice formats, etc.
PDF : Extracts text and metadata from PDF files.
HTML / XML : Parses HTML and XML content.
Plain Text : .txt and similar files.
Images and Media : JPEG, PNG, MP3, MP4, WAV and extracts related metadata.
Email : EML files.
Compressed Archives : ZIP, TAR, GZ and their contents.
Tika achieves this by integrating many open‑source libraries such as Apache POI, PDFBox, and Tesseract OCR.
Key Features
Automatic File Type Detection : Determines MIME type based on file content, not just the extension.
Text and Metadata Extraction : Retrieves document text and metadata like author, creation date, size, and copyright.
OCR Support : Uses Tesseract to extract text from scanned images or PDFs.
Language Detection : Identifies the language of extracted text for multilingual processing.
Multithreading : Supports parallel processing of large batches of files.
Unified Output : Returns results in JSON or XML for easy integration.
Embedding Tika
Tika is written in Java and can be used as a standalone command‑line tool ( Tika App), as a RESTful service ( Tika Server), or embedded directly via its Java API.
Architecture Components
Tika Core provides basic parsing, MIME detection, and content extraction. Tika Parsers are specialized modules for different formats, including text, media, document, and metadata parsers.
Configuration is managed through tika-config.xml, allowing custom parsers and extraction strategies.
Integration Example: Sensitive Information Detection in Spring Boot
The following Maven dependencies are required:
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.6.0</version>
</dependency>
</dependencies>SensitiveInfoService.java uses Tika to parse an uploaded file and applies regular expressions to find ID numbers, credit‑card numbers, and phone numbers.
package com.example.tikademo.service;
import org.apache.tika.Tika;
import org.springframework.stereotype.Service;
import java.io.InputStream;
import java.util.regex.*;
@Service
public class SensitiveInfoService {
private final Tika tika = new Tika();
private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";
public String checkSensitiveInfo(InputStream is) throws Exception {
String content = tika.parseToString(is);
StringBuilder sb = new StringBuilder();
detectAndAppend(content, ID_CARD_REGEX, "ID Number", sb);
detectAndAppend(content, CREDIT_CARD_REGEX, "Credit Card", sb);
detectAndAppend(content, PHONE_REGEX, "Phone Number", sb);
return sb.length() > 0 ? sb.toString() : "No sensitive information found";
}
private void detectAndAppend(String text, String regex, String label, StringBuilder out) {
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
out.append(label).append(": ").append(m.group()).append("
");
}
}
}FileController.java provides a REST endpoint to upload files.
package com.example.tikademo.controller;
import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
@RestController
@RequestMapping("/api/files")
public class FileController {
@Autowired
private SensitiveInfoService service;
@PostMapping("/upload")
public String uploadFile(@RequestParam("file") MultipartFile file) {
try {
return service.checkSensitiveInfo(file.getInputStream());
} catch (IOException e) {
return "File processing error: " + e.getMessage();
} catch (Exception e) {
return "Parsing error: " + e.getMessage();
}
}
}A simple index.html page can be placed under src/main/resources/static/ to test the upload functionality.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Upload File for Sensitive Information Detection</title>
</head>
<body>
<h2>Upload a File</h2>
<form action="/api/files/upload" method="post" enctype="multipart/form-data">
<input type="file" name="file" required>
<button type="submit">Upload</button>
</form>
</body>
</html>Typical usage scenarios include enterprise document management, content management systems, big‑data pipelines, search engine indexing, digital asset management, and information‑security checks for sensitive data leakage.
Summary
By embedding Apache Tika into a Spring Boot application, developers can automatically parse diverse file types, extract valuable text and metadata, and apply custom logic—such as regular‑expression based sensitive‑data detection—to protect against data leaks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
