Apache Tika: Extract Multi-Format Content & Detect Sensitive Data in Spring Boot

This article introduces Apache Tika's capabilities for parsing a wide range of file formats, automatic type detection, OCR and language detection, then demonstrates how to integrate Tika into a Spring Boot service to extract text and identify sensitive information such as ID numbers, credit cards, and phone numbers.

Architect
Architect
Architect
Apache Tika: Extract Multi-Format Content & Detect Sensitive Data in Spring Boot

Apache Tika Overview

Apache Tika is a Java library for detecting file types and extracting text, metadata, and structured information from a wide range of formats.

Supported Formats

Office documents : Microsoft Word (.doc, .docx), Excel (.xls, .xlsx), PowerPoint (.ppt, .pptx), OpenOffice (.odt, .ods) and similar.

PDF : Text and metadata extraction.

HTML / XML : Parsing of HTML and XML content.

Plain text : .txt and other simple text files.

Images, audio, video : JPEG, PNG, MP3, MP4, WAV, etc., with metadata extraction.

Email : EML files.

Compressed archives : ZIP, TAR, GZ and other archive types.

Key Capabilities

Automatic MIME‑type detection based on file content rather than extension.

Extraction of raw text and common metadata fields (author, creation date, modification date, file size, copyright, etc.).

OCR support via Tesseract for scanned images or PDFs containing pictures of text.

Language detection for the extracted text.

Multithreaded processing for high‑throughput batch jobs.

Unified output in JSON or XML to simplify downstream integration.

Core Architecture

Tika Core : Provides parsing, MIME detection, and content‑extraction APIs.

Tika Parsers : A collection of format‑specific parsers built on libraries such as Apache POI, PDFBox, and Tesseract.

Tika Config : XML configuration file ( tika-config.xml) for custom parser selection, character‑set handling, and extraction policies.

Tika App : Command‑line tool for quick extraction.

Tika Server : RESTful service exposing a /tika endpoint for remote parsing.

Spring Boot Integration – Sensitive‑Info Detection

This example demonstrates embedding Tika in a Spring Boot application to extract file content and scan it for personal identifiers using regular expressions.

1. Maven Dependencies (pom.xml)

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>2.6.0</version>
    </dependency>
</dependencies>

2. Service Class

package com.example.tikademo.service;

import org.apache.tika.Tika;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

@Service
public class SensitiveInfoService {
    private final Tika tika = new Tika();

    private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
    private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
    private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";

    public String checkSensitiveInfo(InputStream in) throws IOException {
        String text = tika.parseToString(in);
        StringBuilder sb = new StringBuilder();
        detect(text, ID_CARD_REGEX, "ID Card", sb);
        detect(text, CREDIT_CARD_REGEX, "Credit Card", sb);
        detect(text, PHONE_REGEX, "Phone Number", sb);
        return sb.length() > 0 ? sb.toString() : "No sensitive information detected";
    }

    private void detect(String content, String regex, String label, StringBuilder out) {
        Matcher m = Pattern.compile(regex).matcher(content);
        while (m.find()) {
            out.append(label).append(": ").append(m.group()).append("
");
        }
    }
}

3. REST Controller

package com.example.tikademo.controller;

import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

import java.io.IOException;

@RestController
@RequestMapping("/api/files")
public class FileController {

    @Autowired
    private SensitiveInfoService service;

    @PostMapping("/upload")
    public String upload(@RequestParam("file") MultipartFile file) {
        try {
            return service.checkSensitiveInfo(file.getInputStream());
        } catch (IOException e) {
            return "File processing error: " + e.getMessage();
        }
    }
}

4. Test Procedure

Create a text file test.txt containing sample ID numbers, credit‑card numbers and phone numbers.

Start the Spring Boot application (default port 8080).

POST the file to http://localhost:8080/api/files/upload using curl, Postman, or the optional HTML form.

The service returns each detected value, for example:

ID Card: 123456789012345678
Credit Card: 1234-5678-9876-5432
Phone Number: 138-1234-5678

Notes and Extensions

Additional regular expressions can be added to detect emails, addresses, SSNs, etc.

For scanned PDFs, ensure Tesseract native libraries are installed; Tika will invoke OCR automatically when image content is encountered.

The same extraction logic can be exposed via Tika Server to obtain JSON or XML output without writing custom code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaSpring BootApache TikaContent ExtractionSensitive Data DetectionFile Parsing
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.