Information Security 23 min read

Integrating Apache Tika with Spring Boot for Sensitive Information Detection and Data Leakage Prevention

This article explains Apache Tika's core features, architecture, and multiple application scenarios, then provides a step‑by‑step guide to embed Tika in a Spring Boot project to extract file content, detect personal data such as ID numbers, credit cards and phone numbers using regular expressions, and protect against data leakage.

Selected Java Interview Questions
Selected Java Interview Questions
Selected Java Interview Questions
Integrating Apache Tika with Spring Boot for Sensitive Information Detection and Data Leakage Prevention

Apache Tika Main Features

Apache Tika is a powerful content analysis library that can extract text, metadata, and structured information from a wide range of file formats, including office documents, PDFs, HTML/XML, plain text, images, audio/video, emails, and compressed archives. It relies on many open‑source parsers such as Apache POI, PDFBox and Tesseract OCR.

Key capabilities

Multi‑format support for documents, PDFs, HTML, text, media and archives.

Automatic MIME type detection based on file content.

Text and metadata extraction (author, creation date, size, etc.).

Built‑in OCR via Tesseract for scanned images and PDFs.

Language detection for multilingual content.

Embeddable Java API, command‑line tool (Tika App) and RESTful server (Tika Server).

Multithreaded processing for large batches.

Unified JSON or XML output.

Extensible architecture allowing custom parsers and configuration.

Tika Architecture Components

The framework consists of core parsing, a set of parsers for different media types, configuration management, a CLI application, and a server component.

1. Tika Core

File parsing (Parser)

Content extraction (text, images, audio, video)

MIME type detection

2. Tika Parsers

Text parsers for .txt, .xml, .html, etc.

Media parsers for images, audio, video.

Document parsers for Word, Excel, PowerPoint, PDF.

Metadata parsers for author, title, timestamps.

3. Tika Config

Allows custom parser selection and behavior via tika-config.xml .

4. Tika App (CLI)

Command‑line interface for extracting text and metadata from files.

5. Tika Server

RESTful service that accepts file uploads over HTTP and returns parsed content.

6. Language Detection, OCR, Extractor, Metadata, etc.

Additional modules provide language identification, OCR integration, a unified extraction API, and metadata handling.

Typical Application Scenarios

Enterprise document management and full‑text search.

Content Management Systems for automatic file processing.

Big‑data pipelines to turn unstructured files into structured data for analysis.

Legal and compliance document review.

Digital Asset Management (DAM) for media metadata extraction.

Information security – scanning files for sensitive personal data.

Automated email classification.

Implementing Information Security and Data Leakage Prevention with Tika in Spring Boot

The following steps show how to integrate Apache Tika into a Spring Boot application to detect sensitive information such as ID numbers, credit‑card numbers and phone numbers during file upload.

1. Add dependencies

<dependencies>
  <!-- Spring Boot Web -->
  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
  </dependency>

  <!-- Apache Tika -->
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.6.0</version>
  </dependency>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>2.6.0</version>
  </dependency>
</dependencies>

2. Create the sensitive‑info detection service

package com.example.tikademo.service;

import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@Service
public class SensitiveInfoService {
    private final Tika tika = new Tika(); // Tika instance

    // Regex patterns for ID card, credit card and phone number
    private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
    private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
    private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";

    // Extract file content and detect sensitive info
    public String checkSensitiveInfo(InputStream fileInputStream) throws IOException {
        // 1. Use Tika to extract text
        String fileContent = tika.parseToString(fileInputStream);
        // 2. Perform detection
        StringBuilder sensitiveInfoDetected = new StringBuilder();
        detectAndAppend(fileContent, ID_CARD_REGEX, "身份证号", sensitiveInfoDetected);
        detectAndAppend(fileContent, CREDIT_CARD_REGEX, "信用卡号", sensitiveInfoDetected);
        detectAndAppend(fileContent, PHONE_REGEX, "电话号码", sensitiveInfoDetected);
        return sensitiveInfoDetected.length() > 0 ? sensitiveInfoDetected.toString() : "未检测到敏感信息";
    }

    // Generic detection helper
    private void detectAndAppend(String content, String regex, String label, StringBuilder result) {
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(content);
        while (matcher.find()) {
            result.append(label).append(": ").append(matcher.group()).append("\n");
        }
    }
}

3. Create the file‑upload controller

package com.example.tikademo.controller;

import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

import java.io.IOException;

@RestController
@RequestMapping("/api/files")
public class FileController {

    @Autowired
    private SensitiveInfoService sensitiveInfoService;

    @PostMapping("/upload")
    public String uploadFile(@RequestParam("file") MultipartFile file) {
        try {
            String result = sensitiveInfoService.checkSensitiveInfo(file.getInputStream());
            return result;
        } catch (IOException e) {
            return "文件处理错误: " + e.getMessage();
        }
    }
}

4. Optional front‑end page (index.html)

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Upload File for Sensitive Information Detection</title>
</head>
<body>
  <h2>Upload a File for Sensitive Information Detection</h2>
  <form action="/api/files/upload" method="post" enctype="multipart/form-data">
    <input type="file" name="file" required>
    <button type="submit">Upload</button>
  </form>
</body>
</html>

5. Test document (test.txt)

尊敬的用户:

您好!感谢您使用我们的服务。以下是您的账户信息:

身份证号:123456789012345678
信用卡号:1234-5678-9876-5432
电话号码:138-1234-5678

如果您对我们的服务有任何问题,请随时联系客户支持团队。

谢谢!

此致,
敬礼!

6. Running the demo

Start the Spring Boot application, open http://localhost:8080 , upload test.txt , and the service will return the detected sensitive fields, e.g.:

身份证号: 123456789012345678
信用卡号: 1234-5678-9876-5432
电话号码: 138-1234-5678

This demonstrates how Apache Tika can be leveraged for automatic content extraction and sensitive data identification, providing a practical data‑leakage‑prevention solution for backend systems.

JavaSpring BootFile UploadInformation SecurityApache TikaSensitive Data Detection
Selected Java Interview Questions
Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.