Integrating Apache Tika with Spring Boot for Sensitive Information Detection and Data Leakage Prevention
This article explains Apache Tika's core features, architecture, and multiple application scenarios, then provides a step‑by‑step guide to embed Tika in a Spring Boot project to extract file content, detect personal data such as ID numbers, credit cards and phone numbers using regular expressions, and protect against data leakage.
Apache Tika Main Features
Apache Tika is a powerful content analysis library that can extract text, metadata, and structured information from a wide range of file formats, including office documents, PDFs, HTML/XML, plain text, images, audio/video, emails, and compressed archives. It relies on many open‑source parsers such as Apache POI, PDFBox and Tesseract OCR.
Key capabilities
Multi‑format support for documents, PDFs, HTML, text, media and archives.
Automatic MIME type detection based on file content.
Text and metadata extraction (author, creation date, size, etc.).
Built‑in OCR via Tesseract for scanned images and PDFs.
Language detection for multilingual content.
Embeddable Java API, command‑line tool (Tika App) and RESTful server (Tika Server).
Multithreaded processing for large batches.
Unified JSON or XML output.
Extensible architecture allowing custom parsers and configuration.
Tika Architecture Components
The framework consists of core parsing, a set of parsers for different media types, configuration management, a CLI application, and a server component.
1. Tika Core
File parsing (Parser)
Content extraction (text, images, audio, video)
MIME type detection
2. Tika Parsers
Text parsers for .txt, .xml, .html, etc.
Media parsers for images, audio, video.
Document parsers for Word, Excel, PowerPoint, PDF.
Metadata parsers for author, title, timestamps.
3. Tika Config
Allows custom parser selection and behavior via tika-config.xml .
4. Tika App (CLI)
Command‑line interface for extracting text and metadata from files.
5. Tika Server
RESTful service that accepts file uploads over HTTP and returns parsed content.
6. Language Detection, OCR, Extractor, Metadata, etc.
Additional modules provide language identification, OCR integration, a unified extraction API, and metadata handling.
Typical Application Scenarios
Enterprise document management and full‑text search.
Content Management Systems for automatic file processing.
Big‑data pipelines to turn unstructured files into structured data for analysis.
Legal and compliance document review.
Digital Asset Management (DAM) for media metadata extraction.
Information security – scanning files for sensitive personal data.
Automated email classification.
Implementing Information Security and Data Leakage Prevention with Tika in Spring Boot
The following steps show how to integrate Apache Tika into a Spring Boot application to detect sensitive information such as ID numbers, credit‑card numbers and phone numbers during file upload.
1. Add dependencies
<dependencies>
<!-- Spring Boot Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Apache Tika -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.6.0</version>
</dependency>
</dependencies>2. Create the sensitive‑info detection service
package com.example.tikademo.service;
import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@Service
public class SensitiveInfoService {
private final Tika tika = new Tika(); // Tika instance
// Regex patterns for ID card, credit card and phone number
private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";
// Extract file content and detect sensitive info
public String checkSensitiveInfo(InputStream fileInputStream) throws IOException {
// 1. Use Tika to extract text
String fileContent = tika.parseToString(fileInputStream);
// 2. Perform detection
StringBuilder sensitiveInfoDetected = new StringBuilder();
detectAndAppend(fileContent, ID_CARD_REGEX, "身份证号", sensitiveInfoDetected);
detectAndAppend(fileContent, CREDIT_CARD_REGEX, "信用卡号", sensitiveInfoDetected);
detectAndAppend(fileContent, PHONE_REGEX, "电话号码", sensitiveInfoDetected);
return sensitiveInfoDetected.length() > 0 ? sensitiveInfoDetected.toString() : "未检测到敏感信息";
}
// Generic detection helper
private void detectAndAppend(String content, String regex, String label, StringBuilder result) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
result.append(label).append(": ").append(matcher.group()).append("\n");
}
}
}3. Create the file‑upload controller
package com.example.tikademo.controller;
import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
@RestController
@RequestMapping("/api/files")
public class FileController {
@Autowired
private SensitiveInfoService sensitiveInfoService;
@PostMapping("/upload")
public String uploadFile(@RequestParam("file") MultipartFile file) {
try {
String result = sensitiveInfoService.checkSensitiveInfo(file.getInputStream());
return result;
} catch (IOException e) {
return "文件处理错误: " + e.getMessage();
}
}
}4. Optional front‑end page (index.html)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Upload File for Sensitive Information Detection</title>
</head>
<body>
<h2>Upload a File for Sensitive Information Detection</h2>
<form action="/api/files/upload" method="post" enctype="multipart/form-data">
<input type="file" name="file" required>
<button type="submit">Upload</button>
</form>
</body>
</html>5. Test document (test.txt)
尊敬的用户:
您好!感谢您使用我们的服务。以下是您的账户信息:
身份证号:123456789012345678
信用卡号:1234-5678-9876-5432
电话号码:138-1234-5678
如果您对我们的服务有任何问题,请随时联系客户支持团队。
谢谢!
此致,
敬礼!6. Running the demo
Start the Spring Boot application, open http://localhost:8080 , upload test.txt , and the service will return the detected sensitive fields, e.g.:
身份证号: 123456789012345678
信用卡号: 1234-5678-9876-5432
电话号码: 138-1234-5678This demonstrates how Apache Tika can be leveraged for automatic content extraction and sensitive data identification, providing a practical data‑leakage‑prevention solution for backend systems.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.