Integrating Apache Tika with Spring Boot for Sensitive Information Detection and Data Leak Prevention
This guide demonstrates how to integrate Apache Tika into a Spring Boot application to automatically extract file content, detect sensitive data such as ID numbers, credit cards, and phone numbers using regular expressions, and implement data leak protection through a REST API with code examples.
Apache Tika is a powerful content analysis library that can extract text, metadata, and structured information from a wide variety of file formats. Its key features include multi‑format support, automatic MIME type detection, text and metadata extraction, OCR integration, language detection, multithreaded processing, and unified JSON/XML output.
1. Add Required Dependencies
Include the following Maven dependencies (or the equivalent Gradle ones) in your project:
<dependencies>
<!-- Spring Boot Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Apache Tika Core and Parsers -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.6.0</version>
</dependency>
</dependencies>2. Create Sensitive Information Detection Service
The service uses Tika to parse the uploaded file into a plain‑text string and then applies regular‑expression patterns to locate ID numbers, credit‑card numbers, and phone numbers.
package com.example.tikademo.service;
import org.apache.tika.Tika;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@Service
public class SensitiveInfoService {
private final Tika tika = new Tika();
private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";
public String checkSensitiveInfo(InputStream fileInputStream) throws IOException {
String fileContent = tika.parseToString(fileInputStream);
StringBuilder result = new StringBuilder();
detectAndAppend(fileContent, ID_CARD_REGEX, "身份证号", result);
detectAndAppend(fileContent, CREDIT_CARD_REGEX, "信用卡号", result);
detectAndAppend(fileContent, PHONE_REGEX, "电话号码", result);
return result.length() > 0 ? result.toString() : "未检测到敏感信息";
}
private void detectAndAppend(String content, String regex, String label, StringBuilder result) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
result.append(label).append(": ").append(matcher.group()).append("\n");
}
}
}3. Create File Upload Controller
A simple REST controller receives multipart file uploads, forwards the input stream to the service, and returns the detection result.
package com.example.tikademo.controller;
import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
@RestController
@RequestMapping("/api/files")
public class FileController {
@Autowired
private SensitiveInfoService sensitiveInfoService;
@PostMapping("/upload")
public String uploadFile(@RequestParam("file") MultipartFile file) {
try {
return sensitiveInfoService.checkSensitiveInfo(file.getInputStream());
} catch (IOException e) {
return "文件处理错误: " + e.getMessage();
}
}
}4. Optional Front‑End Page
A minimal HTML page (placed under src/main/resources/static/ ) allows users to select a file and submit it to the above endpoint.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Upload File for Sensitive Information Detection</title>
</head>
<body>
<h2>Upload a File for Sensitive Information Detection</h2>
<form action="/api/files/upload" method="post" enctype="multipart/form-data">
<input type="file" name="file" required>
<button type="submit">Upload</button>
</form>
</body>
</html>5. Test the Application
Start the Spring Boot application and open http://localhost:8080 . Upload a test document (e.g., test.txt ) that contains an ID number, a credit‑card number, and a phone number. The service will parse the file, run the regex checks, and return a response such as:
身份证号: 123456789012345678
信用卡号: 1234-5678-9876-5432
电话号码: 138-1234-56786. Extending the Solution
Additional regex patterns can be added to detect emails, addresses, social‑security numbers, etc. Detected sensitive data can be encrypted, masked, or logged for audit purposes, providing a comprehensive data‑leak‑prevention strategy.
7. Why Apache Tika?
Tika’s ability to handle dozens of file formats—including Office documents, PDFs, images, audio/video, archives, and emails—makes it ideal for extracting unstructured content in security‑oriented workflows such as document management, legal review, big‑data ingestion, and digital‑asset management.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.