Backend Development 4 min read

Extracting Personal Information from PDF, DOC, DOCX, and TXT Files Using Apache Tika

This tutorial demonstrates how to use Apache Tika in a Java project to parse PDF, Word, and text documents, extract specific fields such as name and ID number, and shows the required Maven dependencies and sample code for performing the extraction.

Java Captain

Apr 27, 2025

Extracting Personal Information from PDF, DOC, DOCX, and TXT Files Using Apache Tika

This article explains how to extract feature data—such as a person's name and ID number—from various document formats (PDF, DOC, DOCX, TXT) using Apache Tika. The author provides a step‑by‑step guide that has been personally tested.

1. Add Maven dependencies

<!-- apache tika package for parsing pdf, word, txt -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.8.0</version>
</dependency>
<!-- tika‑parsers‑standard‑package -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>2.8.0</version>
</dependency>
<!-- xmlbeans needed for Word parsing -->
<dependency>
    <groupId>org.apache.xmlbeans</groupId>
    <artifactId>xmlbeans</artifactId>
    <version>5.1.1</version>
</dependency>

2. Write the Java code

package org.example.wordcontent;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Extract data from pdf, doc, docx, txt using Apache Tika.
 * Core jars: tika-core 2.8.0, tika-parsers-standard-package 2.8.0 (requires xmlbeans 5.1.1 for Word).
 * Assumes documents contain fields like:
 *   授权人（签字）：张三
 *   身份证号码: 322025199902256056
 */
public class TikaExtrator {
    public static void main(String[] args) {
        try {
            // Replace with actual file path; example uses a resource file.
            InputStream input = TikaExtrator.class.getClassLoader().getResourceAsStream("综合信息查询授权书测试.docx");
            String text = extractTextFromFile(input);
            System.out.println("text: " + text);
            String name = extractName(text);
            String idNumber = extractIdNumber(text);
            System.out.println("授权人姓名: " + name);
            System.out.println("身份证号码: " + idNumber);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static String extractTextFromFile(InputStream inputStream) throws IOException {
        Tika tika = new Tika();
        try {
            return tika.parseToString(inputStream);
        } catch (TikaException e) {
            throw new RuntimeException(e);
        }
    }

    private static String extractName(String text) {
        Pattern pattern = Pattern.compile("授权人（签字）[：:]([\\u4e00-\\u9fa5]+)");
        Matcher matcher = pattern.matcher(text);
        if (matcher.find()) {
            return matcher.group(1);
        }
        return "";
    }

    private static String extractIdNumber(String text) {
        Pattern pattern = Pattern.compile("身份证号码[：:](\\d{18}|\\d{15})");
        Matcher matcher = pattern.matcher(text);
        if (matcher.find()) {
            return matcher.group(1);
        }
        return "";
    }
}

3. Execution result

Running the program prints the extracted text, the name (e.g., 张三), and the ID number (e.g., 322025199902256056). The original article includes a screenshot of the console output.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java maven Data Extraction Apache Tika Document Parsing regex

Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.