Information Security 24 min read

How to Use Apache Tika in Spring Boot for Automatic Sensitive Data Detection

This article explains Apache Tika’s core features and architecture, outlines common use‑cases, and provides a step‑by‑step Spring Boot tutorial—including Maven/Gradle setup, a service that extracts text with Tika, regex‑based sensitive‑info detection, a REST controller, optional front‑end, testing instructions, expected output, and extension ideas.

Java Web Project

Feb 11, 2025

How to Use Apache Tika in Spring Boot for Automatic Sensitive Data Detection

Key Features of Apache Tika

Apache Tika is a powerful content‑analysis library that can extract text, metadata, and structured information from a wide range of file formats. Its main capabilities include:

Support for many document types (Word, Excel, PowerPoint, OpenOffice, PDF, HTML/XML, plain text, images, audio/video, email, archives, etc.) via integrated open‑source parsers such as Apache POI, PDFBox, and Tesseract OCR.

Automatic MIME‑type detection based on file content rather than file extension.

Text and metadata extraction (author, creation date, size, copyright, etc.).

Built‑in OCR support for scanned images and PDFs.

Language detection for multilingual processing.

Embeddable Java API, command‑line tool (Tika App) and RESTful service (Tika Server).

Multithreaded processing for high‑throughput batch jobs.

Unified JSON or XML output formats.

Ability to handle large documents without excessive memory consumption.

Extensible architecture allowing custom parsers and configuration via tika-config.xml.

Tika Architecture Components

The framework is composed of several core modules that work together:

Tika Core : Provides basic parsing, MIME detection, and content extraction APIs.

Tika Parsers : A collection of format‑specific parsers (text, media, document, metadata) built on libraries like POI, PDFBox, and Tesseract.

Tika Config : Allows users to customize which parsers are active and to define extraction strategies.

Tika App : Command‑line interface for quick file inspection.

Tika Server : RESTful service that accepts file uploads and returns extracted content.

Tika Language Detection : Detects the language of extracted text.

Tika Extractor : Uniform abstraction that hides parser differences.

Tika Metadata : Normalizes and exposes metadata in a standard structure.

Tika OCR : Integrates Tesseract to read text from images.

Typical Application Scenarios

Enterprise document‑management systems – automatic indexing of contracts, invoices, and reports.

Content Management Systems – converting uploaded files to searchable text.

Big‑data pipelines – turning unstructured files into structured records for analytics or machine‑learning models.

Legal and compliance – extracting clauses, dates, and parties from large volumes of contracts.

Digital Asset Management – harvesting metadata from images, videos, and audio files.

Information‑security tooling – scanning files for sensitive data such as IDs, credit‑card numbers, or phone numbers.

Automated email classification – parsing attachments and body text for routing.

Integrating Tika with Spring Boot for Sensitive Information Detection

1. Add Maven/Gradle Dependencies

<dependencies>
    <!-- Spring Boot Web -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- Apache Tika Core -->
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.6.0</version>
    </dependency>
    <!-- Apache Tika Parsers (includes PDFBox, POI, Tesseract, etc.) -->
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>2.6.0</version>
    </dependency>
</dependencies>

2. Implement the SensitiveInfoService

package com.example.tikademo.service;

import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@Service
public class SensitiveInfoService {
    private final Tika tika = new Tika(); // Tika instance

    // Regular‑expression patterns for ID card, credit‑card, and phone numbers
    private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
    private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
    private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";

    /**
     * Extracts the file content with Tika and checks it against the regex patterns.
     */
    public String checkSensitiveInfo(InputStream fileInputStream) throws IOException {
        // 1. Use Tika to obtain the full textual representation of the file
        String fileContent = tika.parseToString(fileInputStream);

        // 2. Run the regex checks and collect matches
        StringBuilder sensitiveInfoDetected = new StringBuilder();
        detectAndAppend(fileContent, ID_CARD_REGEX, "身份证号", sensitiveInfoDetected);
        detectAndAppend(fileContent, CREDIT_CARD_REGEX, "信用卡号", sensitiveInfoDetected);
        detectAndAppend(fileContent, PHONE_REGEX, "电话号码", sensitiveInfoDetected);

        return sensitiveInfoDetected.length() > 0 ? sensitiveInfoDetected.toString() : "未检测到敏感信息";
    }

    /**
     * Helper that applies a single regex and appends any matches to the result buffer.
     */
    private void detectAndAppend(String content, String regex, String label, StringBuilder result) {
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(content);
        while (matcher.find()) {
            result.append(label).append(": ").append(matcher.group()).append("
");
        }
    }
}

3. Create the FileController

package com.example.tikademo.controller;

import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

import java.io.IOException;

@RestController
@RequestMapping("/api/files")
public class FileController {
    @Autowired
    private SensitiveInfoService sensitiveInfoService;

    @PostMapping("/upload")
    public String uploadFile(@RequestParam("file") MultipartFile file) {
        try {
            // Retrieve the input stream of the uploaded file and run the detection logic
            String result = sensitiveInfoService.checkSensitiveInfo(file.getInputStream());
            return result;
        } catch (IOException e) {
            return "文件处理错误: " + e.getMessage();
        }
    }
}

4. Optional Front‑end Page for Manual Testing

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Upload File for Sensitive Information Detection</title>
</head>
<body>
    <h2>Upload a File for Sensitive Information Detection</h2>
    <form action="/api/files/upload" method="post" enctype="multipart/form-data">
        <input type="file" name="file" required>
        <button type="submit">Upload</button>
    </form>
</body>
</html>

5. Test the Project

Start the Spring Boot application and open http://localhost:8080 in a browser. Upload a test document (e.g., test.txt shown below). The service will parse the file with Tika, run the regex checks, and return any detected sensitive items.

6. Expected Return Result

身份证号: 123456789012345678<br/>信用卡号: 1234-5678-9876-5432<br/>电话号码: 138-1234-5678<br/>

7. Possible Extensions

Add more regular expressions to detect e‑mail addresses, postal codes, social‑security numbers, etc.

Encrypt or mask detected data before storing it.

Log detections and send alerts (e‑mail, Slack) for audit purposes.

Integrate with a message queue to process files asynchronously at scale.

Conclusion

By embedding Apache Tika into a Spring Boot service, developers can automatically extract textual content from virtually any file type and apply regex‑based rules to locate personally identifiable information. This approach provides a lightweight yet extensible data‑leak‑prevention solution that can be expanded with additional patterns, encryption, or audit‑logging as needed.

Java maven Spring Boot Information Security Apache Tika Content Extraction Sensitive Data Detection

Written by

Java Web Project

Focused on Java backend technologies, trending internet tech, and the latest industry developments. The platform serves over 200,000 Java developers, inviting you to learn and exchange ideas together. Check the menu for Java learning resources.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.