Information Security 24 min read

Integrating Apache Tika with Spring Boot for Sensitive Information Detection and Data Leakage Prevention

This article demonstrates how to integrate Apache Tika into a Spring Boot application to automatically extract file content, detect sensitive data such as ID numbers, credit cards, and phone numbers using regex, and implement data leakage protection through RESTful file upload endpoints and optional front‑end UI.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
Integrating Apache Tika with Spring Boot for Sensitive Information Detection and Data Leakage Prevention

Tika Main Features

Apache Tika is a powerful content analysis library that can extract text, metadata, and other structured information from a wide variety of file formats.

1. Multi‑format Support

Office documents (Word, Excel, PowerPoint, OpenOffice)

PDF

HTML / XML

Plain text files

Images and audio/video (JPEG, PNG, MP3, MP4, WAV, etc.)

Email (EML)

Compressed archives (ZIP, TAR, GZ)

2. Automatic File Type Detection

Tika can identify a file’s MIME type based on its content rather than its extension, ensuring accurate format recognition.

3. Text and Metadata Extraction

It extracts both the textual content and metadata such as author, creation date, modification date, file size, and copyright information.

4. OCR Support

Through integration with Tesseract OCR, Tika can extract text from scanned images or PDFs.

5. Language Detection

Tika can automatically detect the language of the extracted text, which is useful for multilingual processing.

6. Embedded Application Support

Available as a Java library, Tika can be used as a standalone tool (Tika App) or embedded in other Java applications via its API.

Tika App – command‑line tool for extracting content.

Tika Server – RESTful service for remote file parsing.

7. Multi‑threaded Processing

Tika supports parallel processing to improve performance when handling large batches of files.

8. Unified Output Formats

Extraction results can be returned in JSON or XML, providing a consistent structure for downstream processing.

9. Large File Handling

Tika efficiently processes large or multi‑page documents without excessive memory consumption.

10. Integration with Other Tools

Lucene / Solr / Elasticsearch for full‑text indexing.

Apache POI for Office formats.

PDFBox for PDF parsing.

Tesseract OCR for image text extraction.

11. Extensibility

Users can customize parsers, add new format support, and adjust extraction strategies via configuration files (e.g., tika-config.xml ).

Tika Architecture Components

1. Tika Core

Provides basic parsing, MIME type detection, and content extraction.

2. Tika Parsers

A collection of parsers for text, media, documents, and metadata, built on libraries such as POI, PDFBox, and Tesseract.

3. Tika Config

Manages configuration (e.g., tika-config.xml ) to customize parser selection and behavior.

4. Tika App

CLI tool for extracting text and metadata from files.

5. Tika Server

RESTful API service that accepts file uploads and returns extracted content.

6. Tika Language Detection

Detects the language of extracted text for multilingual workflows.

7. Tika Extractor

Unified interface that abstracts different parsers, simplifying content extraction.

8. Tika Metadata

Standardized metadata extraction and representation.

9. Tika OCR

Integrates OCR to extract text from images and scanned documents.

Application Scenarios

Enterprise document management – automatic extraction and indexing of PDFs, Word, Excel, etc.

Content Management Systems – ingesting uploaded files and converting them to searchable text.

Big data analytics – converting unstructured files into structured data for ML pipelines.

Legal and compliance – scanning contracts and emails for key clauses and personal data.

Digital Asset Management – extracting metadata from images, videos, and audio files.

Information security – detecting sensitive data (ID numbers, credit cards, phone numbers) to prevent data leakage.

Email classification – extracting and categorizing email content and attachments.

Implementing Sensitive Information Detection with Spring Boot

1. Add Dependencies

<dependencies>
    <!-- Spring Boot Web -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    
    <!-- Apache Tika -->
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>2.6.0</version>
    </dependency>
</dependencies>

2. SensitiveInfoService (Regex‑based detection)

package com.example.tikademo.service;

import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@Service
public class SensitiveInfoService {
    private final Tika tika = new Tika();
    private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
    private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
    private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";

    public String checkSensitiveInfo(InputStream fileInputStream) throws IOException {
        String fileContent = tika.parseToString(fileInputStream);
        StringBuilder sensitiveInfoDetected = new StringBuilder();
        detectAndAppend(fileContent, ID_CARD_REGEX, "身份证号", sensitiveInfoDetected);
        detectAndAppend(fileContent, CREDIT_CARD_REGEX, "信用卡号", sensitiveInfoDetected);
        detectAndAppend(fileContent, PHONE_REGEX, "电话号码", sensitiveInfoDetected);
        return sensitiveInfoDetected.length() > 0 ? sensitiveInfoDetected.toString() : "未检测到敏感信息";
    }

    private void detectAndAppend(String content, String regex, String label, StringBuilder result) {
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(content);
        while (matcher.find()) {
            result.append(label).append(": ").append(matcher.group()).append("\n");
        }
    }
}

3. File Upload Controller

package com.example.tikademo.controller;

import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;

@RestController
@RequestMapping("/api/files")
public class FileController {
    @Autowired
    private SensitiveInfoService sensitiveInfoService;

    @PostMapping("/upload")
    public String uploadFile(@RequestParam("file") MultipartFile file) {
        try {
            String result = sensitiveInfoService.checkSensitiveInfo(file.getInputStream());
            return result;
        } catch (IOException e) {
            return "文件处理错误: " + e.getMessage();
        }
    }
}

4. Optional Front‑end Page (index.html)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Upload File for Sensitive Information Detection</title>
</head>
<body>
    <h2>Upload a File for Sensitive Information Detection</h2>
    <form action="/api/files/upload" method="post" enctype="multipart/form-data">
        <input type="file" name="file" required>
        <button type="submit">Upload</button>
    </form>
</body>
</html>

5. Testing the Project

Run the Spring Boot application, open http://localhost:8080 , upload a document (e.g., test.txt ) containing ID numbers, credit‑card numbers, and phone numbers, and observe the detection results returned by the API.

6. Extending the Solution

Add more regex patterns for emails, addresses, social‑security numbers, etc.

Encrypt or mask detected sensitive data before storage.

Log detection events or send alerts to administrators for audit purposes.

Conclusion

By embedding Apache Tika into a Spring Boot project, developers can automate content extraction from diverse file types and apply regex‑based rules to identify sensitive information, providing an effective data‑leakage‑prevention mechanism that can be exposed via simple REST APIs and optionally a lightweight front‑end interface.

JavaSpring BootFile UploadInformation SecurityApache TikaSensitive Data Detection
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.