Backend Development 13 min read

How to Read Excel, Word, PDF, and Text Files in Java

This article explains how to use Java libraries such as Apache POI, PDFBox, and EasyExcel to read Excel, DOC/DOCX, PDF, and plain text files, providing complete code examples, required Maven dependencies, and step‑by‑step usage instructions for each file type.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
How to Read Excel, Word, PDF, and Text Files in Java

In Java development, reading different file formats—including Excel spreadsheets, Word documents (both .doc and .docx), PDF files, and plain text—is a frequent requirement; Apache POI is the most widely used library for Office documents, while Apache PDFBox handles PDFs.

The following utility method demonstrates how to detect the file extension and read its content accordingly, throwing an exception for empty files and returning null for unsupported formats:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

private String readFileContent(MultipartFile file, String fileExtension) throws IOException {
    byte[] fileBytes = file.getBytes();
    if (fileBytes.length == 0) {
        throw new BusinessException(ResultCodeEnum.FILE_CONTENT_IS_EMPTY);
    }
    switch (fileExtension) {
        case "txt":
            return new String(fileBytes, StandardCharsets.UTF_8);
        case "pdf":
            try (PDDocument doc = PDDocument.load(file.getInputStream())) {
                PDFTextStripper textStripper = new PDFTextStripper();
                return textStripper.getText(doc);
            }
        case "docx":
            try (InputStream stream = file.getInputStream()) {
                XWPFDocument xdoc = new XWPFDocument(stream);
                XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
                return extractor.getText();
            }
        case "doc":
            try (InputStream stream = file.getInputStream()) {
                WordExtractor extractor = new WordExtractor(stream);
                return extractor.getText();
            }
        default:
            log.error("不支持的文件格式");
            return null;
    }
}

To compile the above code, add the following Maven dependencies to your pom.xml :

<dependencies>
    <!-- Apache POI for Office documents -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>5.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.0.0</version>
    </dependency>
    <!-- Apache PDFBox for PDF files -->
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.24</version>
    </dependency>
    <!-- Apache Tika (optional) -->
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.1.0</version>
    </dependency>
    <!-- iText (optional) -->
    <dependency>
        <groupId>com.itextpdf</groupId>
        <artifactId>itextpdf</artifactId>
        <version>5.5.13</version>
    </dependency>
</dependencies>

Reading a PDF file can be done with the following simple example, which loads the document, extracts text using PDFTextStripper , prints the content, and finally closes the document:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;

public class PdfReaderExample {
    public static void main(String[] args) {
        try {
            File file = new File("path_to_your_pdf_file.pdf");
            PDDocument document = PDDocument.load(file);
            PDFTextStripper textStripper = new PDFTextStripper();
            String content = textStripper.getText(document);
            System.out.println(content);
            document.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

For DOCX files, Apache POI’s XWPF classes are used. The example below loads a DOCX file via FileInputStream , iterates over paragraphs, builds a string, prints it, and closes resources:

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class DocxReaderExample {
    public static void main(String[] args) {
        try {
            File file = new File("path_to_your_docx_file.docx");
            InputStream fis = new FileInputStream(file);
            XWPFDocument document = new XWPFDocument(fis);
            StringBuilder content = new StringBuilder();
            for (XWPFParagraph paragraph : document.getParagraphs()) {
                content.append(paragraph.getText()).append("\n");
            }
            System.out.println(content.toString());
            document.close();
            fis.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Reading legacy .doc files uses POI’s HWPF module. The code creates a HWPFDocument , extracts text with WordExtractor , and returns the result:

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

public class DocTextExtractor {
    public static String extractTextFromDoc(String filePath) {
        try {
            File file = new File(filePath);
            FileInputStream fis = new FileInputStream(file);
            HWPFDocument document = new HWPFDocument(fis);
            WordExtractor extractor = new WordExtractor(document);
            String text = extractor.getText();
            extractor.close();
            document.close();
            fis.close();
            return text;
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    public static void main(String[] args) {
        String filePath = "path_to_your_doc_file.doc";
        String extractedText = extractTextFromDoc(filePath);
        System.out.println(extractedText);
    }
}

Reading Excel files with POI involves opening the workbook, iterating over rows and cells, and printing each cell’s value. The following snippet demonstrates this process:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

public class ExcelReader {
    public static void main(String[] args) throws IOException {
        File file = new File("path/to/excel/file");
        FileInputStream inputStream = new FileInputStream(file);
        XSSFWorkbook workbook = new XSSFWorkbook(inputStream);
        Sheet sheet = workbook.getSheetAt(0);
        for (Row row : sheet) {
            for (Cell cell : row) {
                System.out.print(cell.toString() + "\t");
            }
            System.out.println();
        }
        workbook.close();
    }
}

As an alternative, the EasyExcel library provides a more concise API for Excel operations. Add its dependency:

<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>easyexcel</artifactId>
    <version>2.4.3</version>
</dependency>

And use the following code to read an Excel file with a custom ReadListener implementation:

import com.alibaba.excel.EasyExcel;
import com.alibaba.excel.read.builder.ExcelReaderBuilder;
import com.alibaba.excel.read.listener.ReadListener;

public class ExcelReader {
    public static void main(String[] args) {
        String filePath = "path_to_your_excel_file.xlsx";
        ExcelReaderBuilder readerBuilder = EasyExcel.read(filePath);
        ReadListener
listener = new YourReadListener();
        readerBuilder.registerReadListener(listener);
        readerBuilder.sheet().doRead();
    }
}

These examples cover the most common scenarios for reading text, PDF, Word, and Excel files in Java, allowing developers to integrate file‑parsing capabilities into backend services or data‑processing pipelines.

JavaFile I/OPDFExcelwordApache POIPDFBox
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.