Backend Development 4 min read

Extracting Personal Information from PDF, DOC, DOCX, and TXT Files Using Apache Tika

This tutorial demonstrates how to use Apache Tika in a Java project to parse PDF, Word, and text documents, extract specific fields such as name and ID number, and shows the required Maven dependencies and sample code for performing the extraction.

Java Captain
Java Captain
Java Captain
Extracting Personal Information from PDF, DOC, DOCX, and TXT Files Using Apache Tika

This article explains how to extract feature data—such as a person's name and ID number—from various document formats (PDF, DOC, DOCX, TXT) using Apache Tika. The author provides a step‑by‑step guide that has been personally tested.

1. Add Maven dependencies

<!-- apache tika package for parsing pdf, word, txt -->
org.apache.tika
tika-core
2.8.0
org.apache.tika
tika-parsers-standard-package
2.8.0
org.apache.xmlbeans
xmlbeans
5.1.1

2. Write the Java code

package org.example.wordcontent;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Extract data from pdf, doc, docx, txt using Apache Tika.
 * Core jars: tika-core 2.8.0, tika-parsers-standard-package 2.8.0 (requires xmlbeans 5.1.1 for Word).
 * Assumes documents contain fields like:
 *   授权人(签字):张三
 *   身份证号码: 322025199902256056
 */
public class TikaExtrator {
    public static void main(String[] args) {
        try {
            // Replace with actual file path; example uses a resource file.
            InputStream input = TikaExtrator.class.getClassLoader().getResourceAsStream("综合信息查询授权书测试.docx");
            String text = extractTextFromFile(input);
            System.out.println("text: " + text);
            String name = extractName(text);
            String idNumber = extractIdNumber(text);
            System.out.println("授权人姓名: " + name);
            System.out.println("身份证号码: " + idNumber);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static String extractTextFromFile(InputStream inputStream) throws IOException {
        Tika tika = new Tika();
        try {
            return tika.parseToString(inputStream);
        } catch (TikaException e) {
            throw new RuntimeException(e);
        }
    }

    private static String extractName(String text) {
        Pattern pattern = Pattern.compile("授权人(签字)[::]([\\u4e00-\\u9fa5]+)");
        Matcher matcher = pattern.matcher(text);
        if (matcher.find()) {
            return matcher.group(1);
        }
        return "";
    }

    private static String extractIdNumber(String text) {
        Pattern pattern = Pattern.compile("身份证号码[::](\\d{18}|\\d{15})");
        Matcher matcher = pattern.matcher(text);
        if (matcher.find()) {
            return matcher.group(1);
        }
        return "";
    }
}

3. Execution result

Running the program prints the extracted text, the name (e.g., 张三), and the ID number (e.g., 322025199902256056). The original article includes a screenshot of the console output.

JavaMavenData ExtractionApache TikaDocument Parsingregex
Java Captain
Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.