Backend Development 13 min read

How to Read Excel, Word, PDF, and Text Files in Java

This article explains how to use Java libraries such as Apache POI, PDFBox, and EasyExcel to read Excel, DOC/DOCX, PDF, and plain text files, providing complete code examples, required Maven dependencies, and step‑by‑step usage instructions for each file type.

Java Architect Essentials

Oct 31, 2024

How to Read Excel, Word, PDF, and Text Files in Java

In Java development, reading different file formats—including Excel spreadsheets, Word documents (both .doc and .docx), PDF files, and plain text—is a frequent requirement; Apache POI is the most widely used library for Office documents, while Apache PDFBox handles PDFs.

The following utility method demonstrates how to detect the file extension and read its content accordingly, throwing an exception for empty files and returning null for unsupported formats:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

private String readFileContent(MultipartFile file, String fileExtension) throws IOException {
    byte[] fileBytes = file.getBytes();
    if (fileBytes.length == 0) {
        throw new BusinessException(ResultCodeEnum.FILE_CONTENT_IS_EMPTY);
    }
    switch (fileExtension) {
        case "txt":
            return new String(fileBytes, StandardCharsets.UTF_8);
        case "pdf":
            try (PDDocument doc = PDDocument.load(file.getInputStream())) {
                PDFTextStripper textStripper = new PDFTextStripper();
                return textStripper.getText(doc);
            }
        case "docx":
            try (InputStream stream = file.getInputStream()) {
                XWPFDocument xdoc = new XWPFDocument(stream);
                XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
                return extractor.getText();
            }
        case "doc":
            try (InputStream stream = file.getInputStream()) {
                WordExtractor extractor = new WordExtractor(stream);
                return extractor.getText();
            }
        default:
            log.error("不支持的文件格式");
            return null;
    }
}

To compile the above code, add the following Maven dependencies to your pom.xml:

<dependencies>
    <!-- Apache POI for Office documents -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>5.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.0.0</version>
    </dependency>
    <!-- Apache PDFBox for PDF files -->
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.24</version>
    </dependency>
    <!-- Apache Tika (optional) -->
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.1.0</version>
    </dependency>
    <!-- iText (optional) -->
    <dependency>
        <groupId>com.itextpdf</groupId>
        <artifactId>itextpdf</artifactId>
        <version>5.5.13</version>
    </dependency>
</dependencies>

Reading a PDF file can be done with the following simple example, which loads the document, extracts text using PDFTextStripper, prints the content, and finally closes the document:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;

public class PdfReaderExample {
    public static void main(String[] args) {
        try {
            File file = new File("path_to_your_pdf_file.pdf");
            PDDocument document = PDDocument.load(file);
            PDFTextStripper textStripper = new PDFTextStripper();
            String content = textStripper.getText(document);
            System.out.println(content);
            document.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

For DOCX files, Apache POI’s XWPF classes are used. The example below loads a DOCX file via FileInputStream, iterates over paragraphs, builds a string, prints it, and closes resources:

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class DocxReaderExample {
    public static void main(String[] args) {
        try {
            File file = new File("path_to_your_docx_file.docx");
            InputStream fis = new FileInputStream(file);
            XWPFDocument document = new XWPFDocument(fis);
            StringBuilder content = new StringBuilder();
            for (XWPFParagraph paragraph : document.getParagraphs()) {
                content.append(paragraph.getText()).append("
");
            }
            System.out.println(content.toString());
            document.close();
            fis.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Reading legacy .doc files uses POI’s HWPF module. The code creates a HWPFDocument, extracts text with WordExtractor, and returns the result:

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

public class DocTextExtractor {
    public static String extractTextFromDoc(String filePath) {
        try {
            File file = new File(filePath);
            FileInputStream fis = new FileInputStream(file);
            HWPFDocument document = new HWPFDocument(fis);
            WordExtractor extractor = new WordExtractor(document);
            String text = extractor.getText();
            extractor.close();
            document.close();
            fis.close();
            return text;
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    public static void main(String[] args) {
        String filePath = "path_to_your_doc_file.doc";
        String extractedText = extractTextFromDoc(filePath);
        System.out.println(extractedText);
    }
}

Reading Excel files with POI involves opening the workbook, iterating over rows and cells, and printing each cell’s value. The following snippet demonstrates this process:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

public class ExcelReader {
    public static void main(String[] args) throws IOException {
        File file = new File("path/to/excel/file");
        FileInputStream inputStream = new FileInputStream(file);
        XSSFWorkbook workbook = new XSSFWorkbook(inputStream);
        Sheet sheet = workbook.getSheetAt(0);
        for (Row row : sheet) {
            for (Cell cell : row) {
                System.out.print(cell.toString() + "\t");
            }
            System.out.println();
        }
        workbook.close();
    }
}

As an alternative, the EasyExcel library provides a more concise API for Excel operations. Add its dependency:

<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>easyexcel</artifactId>
    <version>2.4.3</version>
</dependency>

And use the following code to read an Excel file with a custom ReadListener implementation:

import com.alibaba.excel.EasyExcel;
import com.alibaba.excel.read.builder.ExcelReaderBuilder;
import com.alibaba.excel.read.listener.ReadListener;

public class ExcelReader {
    public static void main(String[] args) {
        String filePath = "path_to_your_excel_file.xlsx";
        ExcelReaderBuilder readerBuilder = EasyExcel.read(filePath);
        ReadListener<Object> listener = new YourReadListener();
        readerBuilder.registerReadListener(listener);
        readerBuilder.sheet().doRead();
    }
}

These examples cover the most common scenarios for reading text, PDF, Word, and Excel files in Java, allowing developers to integrate file‑parsing capabilities into backend services or data‑processing pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java File I/O PDF Word Apache POI PDFBox

Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.