Integrating Apache Tika with Spring Boot for Sensitive Information Detection and Data Leakage Prevention
This article demonstrates how to integrate Apache Tika into a Spring Boot application to automatically extract file content, detect sensitive data such as ID numbers, credit cards, and phone numbers using regex, and implement data leakage protection through RESTful file upload endpoints and optional front‑end UI.
Tika Main Features
Apache Tika is a powerful content analysis library that can extract text, metadata, and other structured information from a wide variety of file formats.
1. Multi‑format Support
Office documents (Word, Excel, PowerPoint, OpenOffice)
HTML / XML
Plain text files
Images and audio/video (JPEG, PNG, MP3, MP4, WAV, etc.)
Email (EML)
Compressed archives (ZIP, TAR, GZ)
2. Automatic File Type Detection
Tika can identify a file’s MIME type based on its content rather than its extension, ensuring accurate format recognition.
3. Text and Metadata Extraction
It extracts both the textual content and metadata such as author, creation date, modification date, file size, and copyright information.
4. OCR Support
Through integration with Tesseract OCR, Tika can extract text from scanned images or PDFs.
5. Language Detection
Tika can automatically detect the language of the extracted text, which is useful for multilingual processing.
6. Embedded Application Support
Available as a Java library, Tika can be used as a standalone tool (Tika App) or embedded in other Java applications via its API.
Tika App – command‑line tool for extracting content.
Tika Server – RESTful service for remote file parsing.
7. Multi‑threaded Processing
Tika supports parallel processing to improve performance when handling large batches of files.
8. Unified Output Formats
Extraction results can be returned in JSON or XML, providing a consistent structure for downstream processing.
9. Large File Handling
Tika efficiently processes large or multi‑page documents without excessive memory consumption.
10. Integration with Other Tools
Lucene / Solr / Elasticsearch for full‑text indexing.
Apache POI for Office formats.
PDFBox for PDF parsing.
Tesseract OCR for image text extraction.
11. Extensibility
Users can customize parsers, add new format support, and adjust extraction strategies via configuration files (e.g., tika-config.xml ).
Tika Architecture Components
1. Tika Core
Provides basic parsing, MIME type detection, and content extraction.
2. Tika Parsers
A collection of parsers for text, media, documents, and metadata, built on libraries such as POI, PDFBox, and Tesseract.
3. Tika Config
Manages configuration (e.g., tika-config.xml ) to customize parser selection and behavior.
4. Tika App
CLI tool for extracting text and metadata from files.
5. Tika Server
RESTful API service that accepts file uploads and returns extracted content.
6. Tika Language Detection
Detects the language of extracted text for multilingual workflows.
7. Tika Extractor
Unified interface that abstracts different parsers, simplifying content extraction.
8. Tika Metadata
Standardized metadata extraction and representation.
9. Tika OCR
Integrates OCR to extract text from images and scanned documents.
Application Scenarios
Enterprise document management – automatic extraction and indexing of PDFs, Word, Excel, etc.
Content Management Systems – ingesting uploaded files and converting them to searchable text.
Big data analytics – converting unstructured files into structured data for ML pipelines.
Legal and compliance – scanning contracts and emails for key clauses and personal data.
Digital Asset Management – extracting metadata from images, videos, and audio files.
Information security – detecting sensitive data (ID numbers, credit cards, phone numbers) to prevent data leakage.
Email classification – extracting and categorizing email content and attachments.
Implementing Sensitive Information Detection with Spring Boot
1. Add Dependencies
<dependencies>
<!-- Spring Boot Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Apache Tika -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.6.0</version>
</dependency>
</dependencies>2. SensitiveInfoService (Regex‑based detection)
package com.example.tikademo.service;
import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@Service
public class SensitiveInfoService {
private final Tika tika = new Tika();
private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";
public String checkSensitiveInfo(InputStream fileInputStream) throws IOException {
String fileContent = tika.parseToString(fileInputStream);
StringBuilder sensitiveInfoDetected = new StringBuilder();
detectAndAppend(fileContent, ID_CARD_REGEX, "身份证号", sensitiveInfoDetected);
detectAndAppend(fileContent, CREDIT_CARD_REGEX, "信用卡号", sensitiveInfoDetected);
detectAndAppend(fileContent, PHONE_REGEX, "电话号码", sensitiveInfoDetected);
return sensitiveInfoDetected.length() > 0 ? sensitiveInfoDetected.toString() : "未检测到敏感信息";
}
private void detectAndAppend(String content, String regex, String label, StringBuilder result) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
result.append(label).append(": ").append(matcher.group()).append("\n");
}
}
}3. File Upload Controller
package com.example.tikademo.controller;
import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
@RestController
@RequestMapping("/api/files")
public class FileController {
@Autowired
private SensitiveInfoService sensitiveInfoService;
@PostMapping("/upload")
public String uploadFile(@RequestParam("file") MultipartFile file) {
try {
String result = sensitiveInfoService.checkSensitiveInfo(file.getInputStream());
return result;
} catch (IOException e) {
return "文件处理错误: " + e.getMessage();
}
}
}4. Optional Front‑end Page (index.html)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Upload File for Sensitive Information Detection</title>
</head>
<body>
<h2>Upload a File for Sensitive Information Detection</h2>
<form action="/api/files/upload" method="post" enctype="multipart/form-data">
<input type="file" name="file" required>
<button type="submit">Upload</button>
</form>
</body>
</html>5. Testing the Project
Run the Spring Boot application, open http://localhost:8080 , upload a document (e.g., test.txt ) containing ID numbers, credit‑card numbers, and phone numbers, and observe the detection results returned by the API.
6. Extending the Solution
Add more regex patterns for emails, addresses, social‑security numbers, etc.
Encrypt or mask detected sensitive data before storage.
Log detection events or send alerts to administrators for audit purposes.
Conclusion
By embedding Apache Tika into a Spring Boot project, developers can automate content extraction from diverse file types and apply regex‑based rules to identify sensitive information, providing an effective data‑leakage‑prevention mechanism that can be exposed via simple REST APIs and optionally a lightweight front‑end interface.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.