How to Use Apache Tika in Spring Boot for Sensitive Data Detection and DLP
This article explains Apache Tika's core features, architecture, and common use cases, then provides a step‑by‑step Spring Boot tutorial that integrates Tika to extract file content, detect personal identifiers with regex, and return results via a REST API for data‑loss‑prevention.
Tika Core Features
Apache Tika is a powerful content‑analysis library that extracts text, metadata, and other structured information from many file formats.
Supported Formats
Office documents (Word, Excel, PowerPoint, OpenOffice)
HTML / XML
Plain text files
Images and audio/video (JPEG, PNG, MP3, MP4, WAV, etc.)
Email (EML)
Compressed archives (ZIP, TAR, GZ)
Automatic MIME‑Type Detection
Tika determines a file's true type from its content rather than its extension, providing high‑accuracy format recognition.
Text and Metadata Extraction
Text extraction works for any supported format.
Metadata extraction includes author, creation date, modification date, size, copyright and other properties.
OCR Support
Through the integrated Tesseract OCR engine, Tika can extract text from scanned images or PDFs that contain pictures.
Language Detection
Tika can automatically identify the language of extracted text (e.g., English, Chinese, French), which is useful for multilingual processing.
Embedded Usage
Tika App : a command‑line tool for extracting content and metadata.
Tika Server : a RESTful API service for remote file parsing.
Multithreading
Tika supports parallel processing, allowing batch file parsing to be accelerated with multiple threads.
Unified Output Formats
JSON output for easy integration.
XML output for more structured data needs.
Large‑File Handling
Tika can process large or multi‑page documents efficiently without excessive memory consumption.
Integration with Other Tools
Lucene / Solr / Elasticsearch for full‑text indexing.
Apache POI for Office formats.
PDFBox for PDF parsing.
Tesseract OCR for image text extraction.
Tika Architecture
Tika Core
Parser – parses various file formats and returns text and metadata.
Content Extraction – extracts text, images, audio, video, etc.
MIME‑Type Detection – determines the real file type from its content.
Tika Parsers
Text Parsers – handle .txt, .xml, .html and similar.
Media Parsers – handle images, audio, video.
Document Parsers – handle Word, Excel, PowerPoint, PDF and other office documents.
Metadata Parsers – extract file attributes such as author, creation date, size.
Tika Config
Configuration file (tika‑config.xml) lets users customize parser selection, extraction strategies, character sets, etc.
Custom parsers can be added and referenced in the config.
Tika App
A CLI tool that can be run directly to extract text and metadata from files or be embedded in Java applications.
Tika Server
A RESTful service that accepts file uploads via HTTP, parses them remotely, and returns the extracted information.
Additional Components
Tika Language Detection – automatic language identification.
Tika Extractor – a unified interface that abstracts different parsers, allowing custom extensions.
Tika Metadata – provides a standardized metadata structure.
Tika OCR – integrates Tesseract to recognize text in images.
Typical Application Scenarios
Enterprise document‑management systems – automatic extraction and indexing of contracts, reports, etc.
Content Management Systems – extract and convert uploaded files for editing and search.
Big‑Data platforms – transform unstructured files into structured data for cleaning, classification, clustering or text mining.
Legal and compliance review – pull key clauses, dates, amounts from contracts and emails.
Digital Asset Management – extract metadata from images, videos, and audio for cataloguing.
Information‑security and DLP – scan files for personal identifiers such as ID numbers, credit‑card numbers, or phone numbers.
Automated email classification – extract content and attachments for routing or archiving.
Step‑by‑Step Guide: Sensitive Information Detection with Spring Boot
1. Add Dependencies
<dependencies>
<!-- Spring Boot Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Apache Tika -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.6.0</version>
</dependency>
</dependencies>2. Implement SensitiveInfoService
package com.example.tikademo.service;
import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@Service
public class SensitiveInfoService {
private final Tika tika = new Tika();
private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";
public String checkSensitiveInfo(InputStream fileInputStream) throws IOException {
// 1. Extract file content with Tika
String fileContent = tika.parseToString(fileInputStream);
StringBuilder result = new StringBuilder();
// Detect ID numbers
detectAndAppend(fileContent, ID_CARD_REGEX, "身份证号", result);
// Detect credit‑card numbers
detectAndAppend(fileContent, CREDIT_CARD_REGEX, "信用卡号", result);
// Detect phone numbers
detectAndAppend(fileContent, PHONE_REGEX, "电话号码", result);
return result.length() > 0 ? result.toString() : "未检测到敏感信息";
}
private void detectAndAppend(String content, String regex, String label, StringBuilder sb) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
sb.append(label).append(": ").append(matcher.group()).append("
");
}
}
}3. Create FileController
package com.example.tikademo.controller;
import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
@RestController
@RequestMapping("/api/files")
public class FileController {
@Autowired
private SensitiveInfoService sensitiveInfoService;
@PostMapping("/upload")
public String uploadFile(@RequestParam("file") MultipartFile file) {
try {
return sensitiveInfoService.checkSensitiveInfo(file.getInputStream());
} catch (IOException e) {
return "文件处理错误: " + e.getMessage();
}
}
}4. Optional Front‑end Page
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Upload File for Sensitive Information Detection</title>
</head>
<body>
<h2>Upload a File for Sensitive Information Detection</h2>
<form action="/api/files/upload" method="post" enctype="multipart/form-data">
<input type="file" name="file" required>
<button type="submit">Upload</button>
</form>
</body>
</html>5. Test the Project
Run the Spring Boot application, open http://localhost:8080, upload a test file (e.g., a .txt containing an ID number, credit‑card number, and phone number), and the API returns the detected sensitive values.
6. Extend Functionality
Add more regular‑expression patterns for emails, addresses, social‑security numbers, etc.
Encrypt or mask detected data before storage.
Log detections or send notifications to administrators for audit and DLP enforcement.
Conclusion
Integrating Apache Tika into a Spring Boot project enables automatic content extraction and regex‑based sensitive‑information detection, providing a lightweight yet powerful data‑loss‑prevention solution for enterprise applications.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
