Apache Tika: Extract Multi-Format Content & Detect Sensitive Data in Spring Boot
This article introduces Apache Tika's capabilities for parsing a wide range of file formats, automatic type detection, OCR and language detection, then demonstrates how to integrate Tika into a Spring Boot service to extract text and identify sensitive information such as ID numbers, credit cards, and phone numbers.
Apache Tika Overview
Apache Tika is a Java library for detecting file types and extracting text, metadata, and structured information from a wide range of formats.
Supported Formats
Office documents : Microsoft Word (.doc, .docx), Excel (.xls, .xlsx), PowerPoint (.ppt, .pptx), OpenOffice (.odt, .ods) and similar.
PDF : Text and metadata extraction.
HTML / XML : Parsing of HTML and XML content.
Plain text : .txt and other simple text files.
Images, audio, video : JPEG, PNG, MP3, MP4, WAV, etc., with metadata extraction.
Email : EML files.
Compressed archives : ZIP, TAR, GZ and other archive types.
Key Capabilities
Automatic MIME‑type detection based on file content rather than extension.
Extraction of raw text and common metadata fields (author, creation date, modification date, file size, copyright, etc.).
OCR support via Tesseract for scanned images or PDFs containing pictures of text.
Language detection for the extracted text.
Multithreaded processing for high‑throughput batch jobs.
Unified output in JSON or XML to simplify downstream integration.
Core Architecture
Tika Core : Provides parsing, MIME detection, and content‑extraction APIs.
Tika Parsers : A collection of format‑specific parsers built on libraries such as Apache POI, PDFBox, and Tesseract.
Tika Config : XML configuration file ( tika-config.xml) for custom parser selection, character‑set handling, and extraction policies.
Tika App : Command‑line tool for quick extraction.
Tika Server : RESTful service exposing a /tika endpoint for remote parsing.
Spring Boot Integration – Sensitive‑Info Detection
This example demonstrates embedding Tika in a Spring Boot application to extract file content and scan it for personal identifiers using regular expressions.
1. Maven Dependencies (pom.xml)
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>2.6.0</version>
</dependency>
</dependencies>2. Service Class
package com.example.tikademo.service;
import org.apache.tika.Tika;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
@Service
public class SensitiveInfoService {
private final Tika tika = new Tika();
private static final String ID_CARD_REGEX = "(\\d{17}[\\dXx]|\\d{15})";
private static final String CREDIT_CARD_REGEX = "(\\d{4}-?\\d{4}-?\\d{4}-?\\d{4})";
private static final String PHONE_REGEX = "(\\d{3}-?\\d{3}-?\\d{4})|((\\d{11})|(\\d{3})\\d{7})";
public String checkSensitiveInfo(InputStream in) throws IOException {
String text = tika.parseToString(in);
StringBuilder sb = new StringBuilder();
detect(text, ID_CARD_REGEX, "ID Card", sb);
detect(text, CREDIT_CARD_REGEX, "Credit Card", sb);
detect(text, PHONE_REGEX, "Phone Number", sb);
return sb.length() > 0 ? sb.toString() : "No sensitive information detected";
}
private void detect(String content, String regex, String label, StringBuilder out) {
Matcher m = Pattern.compile(regex).matcher(content);
while (m.find()) {
out.append(label).append(": ").append(m.group()).append("
");
}
}
}3. REST Controller
package com.example.tikademo.controller;
import com.example.tikademo.service.SensitiveInfoService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;
import java.io.IOException;
@RestController
@RequestMapping("/api/files")
public class FileController {
@Autowired
private SensitiveInfoService service;
@PostMapping("/upload")
public String upload(@RequestParam("file") MultipartFile file) {
try {
return service.checkSensitiveInfo(file.getInputStream());
} catch (IOException e) {
return "File processing error: " + e.getMessage();
}
}
}4. Test Procedure
Create a text file test.txt containing sample ID numbers, credit‑card numbers and phone numbers.
Start the Spring Boot application (default port 8080).
POST the file to http://localhost:8080/api/files/upload using curl, Postman, or the optional HTML form.
The service returns each detected value, for example:
ID Card: 123456789012345678
Credit Card: 1234-5678-9876-5432
Phone Number: 138-1234-5678Notes and Extensions
Additional regular expressions can be added to detect emails, addresses, SSNs, etc.
For scanned PDFs, ensure Tesseract native libraries are installed; Tika will invoke OCR automatically when image content is encountered.
The same extraction logic can be exposed via Tika Server to obtain JSON or XML output without writing custom code.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
