Unlock Apache Tika: Extract Text, Metadata, and Detect Sensitive Data in Java

This article introduces Apache Tika, a powerful Java library for parsing many file formats, extracting text and metadata, performing OCR and language detection, and shows how to integrate it with Spring Boot to automatically detect sensitive information such as ID numbers, credit cards, and phone numbers.

Java Backend Technology
Java Backend Technology
Java Backend Technology
Unlock Apache Tika: Extract Text, Metadata, and Detect Sensitive Data in Java

Apache Tika Overview

Apache Tika is a powerful content analysis toolkit that can extract text, metadata, and other structured information from a wide variety of file formats.

Supported Formats

Office Documents : Word (.doc, .docx), Excel (.xls, .xlsx), PowerPoint (.ppt, .pptx), OpenOffice formats, etc.

PDF : Extracts text and metadata from PDF files.

HTML / XML : Parses HTML and XML content.

Plain Text : .txt and similar files.

Images and Media : JPEG, PNG, MP3, MP4, WAV and extracts related metadata.

Email : EML files.

Compressed Archives : ZIP, TAR, GZ and their contents.

Tika achieves this by integrating many open‑source libraries such as Apache POI, PDFBox, and Tesseract OCR.

Key Features

Automatic File Type Detection : Determines MIME type based on file content, not just the extension.

Text and Metadata Extraction : Retrieves document text and metadata like author, creation date, size, and copyright.

OCR Support : Uses Tesseract to extract text from scanned images or PDFs.

Language Detection : Identifies the language of extracted text for multilingual processing.

Multithreading : Supports parallel processing of large batches of files.

Unified Output : Returns results in JSON or XML for easy integration.

Embedding Tika

Tika is written in Java and can be used as a standalone command‑line tool ( Tika App), as a RESTful service ( Tika Server), or embedded directly via its Java API.

Architecture Components

Tika Core provides basic parsing, MIME detection, and content extraction. Tika Parsers are specialized modules for different formats, including text, media, document, and metadata parsers.

Configuration is managed through tika-config.xml, allowing custom parsers and extraction strategies.

Integration Example: Sensitive Information Detection in Spring Boot

The following Maven dependencies are required:

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>2.6.0</version>
    </dependency>
</dependencies>

SensitiveInfoService.java uses Tika to parse an uploaded file and applies regular expressions to find ID numbers, credit‑card numbers, and phone numbers.

package com.example.tikademo.service;

import org.apache.tika.Tika;
import org.springframework.stereotype.Service;
import java.io.InputStream;
import java.util.regex.*;

@Service
public class SensitiveInfoService {
    private final Tika tika = new Tika();
    private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
    private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
    private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";

    public String checkSensitiveInfo(InputStream is) throws Exception {
        String content = tika.parseToString(is);
        StringBuilder sb = new StringBuilder();
        detectAndAppend(content, ID_CARD_REGEX, "ID Number", sb);
        detectAndAppend(content, CREDIT_CARD_REGEX, "Credit Card", sb);
        detectAndAppend(content, PHONE_REGEX, "Phone Number", sb);
        return sb.length() > 0 ? sb.toString() : "No sensitive information found";
    }

    private void detectAndAppend(String text, String regex, String label, StringBuilder out) {
        Matcher m = Pattern.compile(regex).matcher(text);
        while (m.find()) {
            out.append(label).append(": ").append(m.group()).append("
");
        }
    }
}

FileController.java provides a REST endpoint to upload files.

package com.example.tikademo.controller;

import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;

@RestController
@RequestMapping("/api/files")
public class FileController {
    @Autowired
    private SensitiveInfoService service;

    @PostMapping("/upload")
    public String uploadFile(@RequestParam("file") MultipartFile file) {
        try {
            return service.checkSensitiveInfo(file.getInputStream());
        } catch (IOException e) {
            return "File processing error: " + e.getMessage();
        } catch (Exception e) {
            return "Parsing error: " + e.getMessage();
        }
    }
}

A simple index.html page can be placed under src/main/resources/static/ to test the upload functionality.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Upload File for Sensitive Information Detection</title>
</head>
<body>
    <h2>Upload a File</h2>
    <form action="/api/files/upload" method="post" enctype="multipart/form-data">
        <input type="file" name="file" required>
        <button type="submit">Upload</button>
    </form>
</body>
</html>

Typical usage scenarios include enterprise document management, content management systems, big‑data pipelines, search engine indexing, digital asset management, and information‑security checks for sensitive data leakage.

Summary

By embedding Apache Tika into a Spring Boot application, developers can automatically parse diverse file types, extract valuable text and metadata, and apply custom logic—such as regular‑expression based sensitive‑data detection—to protect against data leaks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Spring BootApache TikaSensitive Data DetectionFile ParsingMetadata Extraction
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.