How to Use Apache Tika in Spring Boot for Sensitive Data Detection and DLP

This article explains Apache Tika's core features, architecture, and common use cases, then provides a step‑by‑step Spring Boot tutorial that integrates Tika to extract file content, detect personal identifiers with regex, and return results via a REST API for data‑loss‑prevention.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How to Use Apache Tika in Spring Boot for Sensitive Data Detection and DLP

Tika Core Features

Apache Tika is a powerful content‑analysis library that extracts text, metadata, and other structured information from many file formats.

Supported Formats

Office documents (Word, Excel, PowerPoint, OpenOffice)

PDF

HTML / XML

Plain text files

Images and audio/video (JPEG, PNG, MP3, MP4, WAV, etc.)

Email (EML)

Compressed archives (ZIP, TAR, GZ)

Automatic MIME‑Type Detection

Tika determines a file's true type from its content rather than its extension, providing high‑accuracy format recognition.

Text and Metadata Extraction

Text extraction works for any supported format.

Metadata extraction includes author, creation date, modification date, size, copyright and other properties.

OCR Support

Through the integrated Tesseract OCR engine, Tika can extract text from scanned images or PDFs that contain pictures.

Language Detection

Tika can automatically identify the language of extracted text (e.g., English, Chinese, French), which is useful for multilingual processing.

Embedded Usage

Tika App : a command‑line tool for extracting content and metadata.

Tika Server : a RESTful API service for remote file parsing.

Multithreading

Tika supports parallel processing, allowing batch file parsing to be accelerated with multiple threads.

Unified Output Formats

JSON output for easy integration.

XML output for more structured data needs.

Large‑File Handling

Tika can process large or multi‑page documents efficiently without excessive memory consumption.

Integration with Other Tools

Lucene / Solr / Elasticsearch for full‑text indexing.

Apache POI for Office formats.

PDFBox for PDF parsing.

Tesseract OCR for image text extraction.

Tika Architecture

Tika Core

Parser – parses various file formats and returns text and metadata.

Content Extraction – extracts text, images, audio, video, etc.

MIME‑Type Detection – determines the real file type from its content.

Tika Parsers

Text Parsers – handle .txt, .xml, .html and similar.

Media Parsers – handle images, audio, video.

Document Parsers – handle Word, Excel, PowerPoint, PDF and other office documents.

Metadata Parsers – extract file attributes such as author, creation date, size.

Tika Config

Configuration file (tika‑config.xml) lets users customize parser selection, extraction strategies, character sets, etc.

Custom parsers can be added and referenced in the config.

Tika App

A CLI tool that can be run directly to extract text and metadata from files or be embedded in Java applications.

Tika Server

A RESTful service that accepts file uploads via HTTP, parses them remotely, and returns the extracted information.

Additional Components

Tika Language Detection – automatic language identification.

Tika Extractor – a unified interface that abstracts different parsers, allowing custom extensions.

Tika Metadata – provides a standardized metadata structure.

Tika OCR – integrates Tesseract to recognize text in images.

Typical Application Scenarios

Enterprise document‑management systems – automatic extraction and indexing of contracts, reports, etc.

Content Management Systems – extract and convert uploaded files for editing and search.

Big‑Data platforms – transform unstructured files into structured data for cleaning, classification, clustering or text mining.

Legal and compliance review – pull key clauses, dates, amounts from contracts and emails.

Digital Asset Management – extract metadata from images, videos, and audio for cataloguing.

Information‑security and DLP – scan files for personal identifiers such as ID numbers, credit‑card numbers, or phone numbers.

Automated email classification – extract content and attachments for routing or archiving.

Step‑by‑Step Guide: Sensitive Information Detection with Spring Boot

1. Add Dependencies

<dependencies>
    <!-- Spring Boot Web -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <!-- Apache Tika -->
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>2.6.0</version>
    </dependency>
</dependencies>

2. Implement SensitiveInfoService

package com.example.tikademo.service;

import org.apache.tika.Tika;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@Service
public class SensitiveInfoService {
    private final Tika tika = new Tika();

    private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
    private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
    private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";

    public String checkSensitiveInfo(InputStream fileInputStream) throws IOException {
        // 1. Extract file content with Tika
        String fileContent = tika.parseToString(fileInputStream);
        StringBuilder result = new StringBuilder();
        // Detect ID numbers
        detectAndAppend(fileContent, ID_CARD_REGEX, "身份证号", result);
        // Detect credit‑card numbers
        detectAndAppend(fileContent, CREDIT_CARD_REGEX, "信用卡号", result);
        // Detect phone numbers
        detectAndAppend(fileContent, PHONE_REGEX, "电话号码", result);
        return result.length() > 0 ? result.toString() : "未检测到敏感信息";
    }

    private void detectAndAppend(String content, String regex, String label, StringBuilder sb) {
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(content);
        while (matcher.find()) {
            sb.append(label).append(": ").append(matcher.group()).append("
");
        }
    }
}

3. Create FileController

package com.example.tikademo.controller;

import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

import java.io.IOException;

@RestController
@RequestMapping("/api/files")
public class FileController {
    @Autowired
    private SensitiveInfoService sensitiveInfoService;

    @PostMapping("/upload")
    public String uploadFile(@RequestParam("file") MultipartFile file) {
        try {
            return sensitiveInfoService.checkSensitiveInfo(file.getInputStream());
        } catch (IOException e) {
            return "文件处理错误: " + e.getMessage();
        }
    }
}

4. Optional Front‑end Page

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Upload File for Sensitive Information Detection</title>
</head>
<body>
    <h2>Upload a File for Sensitive Information Detection</h2>
    <form action="/api/files/upload" method="post" enctype="multipart/form-data">
        <input type="file" name="file" required>
        <button type="submit">Upload</button>
    </form>
</body>
</html>

5. Test the Project

Run the Spring Boot application, open http://localhost:8080, upload a test file (e.g., a .txt containing an ID number, credit‑card number, and phone number), and the API returns the detected sensitive values.

6. Extend Functionality

Add more regular‑expression patterns for emails, addresses, social‑security numbers, etc.

Encrypt or mask detected data before storage.

Log detections or send notifications to administrators for audit and DLP enforcement.

Conclusion

Integrating Apache Tika into a Spring Boot project enables automatic content extraction and regex‑based sensitive‑information detection, providing a lightweight yet powerful data‑loss‑prevention solution for enterprise applications.

JavaOCRSpring BootApache TikaSensitive Data DetectionFile ParsingDLP
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.