Backend Development 7 min read

How to Extract Text, Links, and Images from PDFs with Apache PDFBox (Java)

This guide shows how to use the open‑source Java library Apache PDFBox to programmatically extract plain text, hyperlinks, and embedded images from PDF documents, complete with step‑by‑step code examples for each task.

21CTO

Oct 2, 2019

How to Extract Text, Links, and Images from PDFs with Apache PDFBox (Java)

If you could automatically extract PDF content, a programmer’s life would be much easier. Java provides the open‑source library Apache PDFBox for this purpose.

What is PDFBox

Apache PDFBox is a Java library for handling PDF documents. It can create new PDFs, update existing ones (e.g., add styles or hyperlinks), and extract content such as text, links, and images.

Extracting Text from a PDF

When you can pull the raw text from a PDF, half the problem is solved. The class PDFTextStripper removes all formatting and returns plain text.

tStripper = new PDFTextStripper();
tStripper.setStartPage(1);
tStripper.setEndPage(3);
PDDocument document = PDDocument.load(new File("youpdfname.pdf"));
if (!document.isEncrypted()) {
    pdfFileInText = tStripper.getText(document);
    lines = pdfFileInText.split("\\r\
\\r\
");
    for (String line : lines) {
        System.out.println(line);
        content += line;
    }
}
System.out.println(content.trim());

Extracting All Hyperlinks from a PDF

PDFBox’s PDPage class provides getAnnotations() to retrieve annotation lists. By inspecting PDAnnotationLink objects and their PDActionURI, you can list every hyperlink.

PDDocument document = PDDocument.load(new File("name.pdf"));
PDPage pdfpage = document.getPage(1);
annotations = pdfpage.getAnnotations();
for (int j = 0; j < annotations.size(); j++) {
    PDAnnotation annot = annotations.get(j);
    if (annot instanceof PDAnnotationLink) {
        PDAnnotationLink link = (PDAnnotationLink) annot;
        PDAction action = link.getAction();
        if (action instanceof PDActionURI) {
            PDActionURI uri = (PDActionURI) action;
            urls += uri.getURI();
            System.out.println(uri.getURI());
        }
    }
}

Exporting Images from a PDF

Beyond text and links, PDFBox can extract embedded images. Using PDPage.getResources() you can iterate over XObjects and save each image as a PNG file.

PDDocument document = PDDocument.load(new File("name.pdf"));
PDPage pdfpage = document.getPage(1);
int i = 1;
PDResources pdResources = pdfpage.getResources();
for (COSName c : pdResources.getXObjectNames()) {
    PDXObject o = pdResources.getXObject(c);
    if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
        File file = new File(i + ".png");
        i++;
        ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) o).getImage(), "png", file);
    }
}

Getting the Words Inside Hyperlink Annotations

To retrieve the visible text of each hyperlink, use PDFTextStripperByArea with rectangles derived from the link’s PDRectangle. The following example extracts the text and prints it together with the URL.

PDDocument document = PDDocument.load(new File("name.pdf"));
int pageNum = 0;
for (PDPage page : doc.getPages()) {
    pageNum++;
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    List<PDAnnotation> annotations = page.getAnnotations();
    // first setup text extraction regions
    for (int j = 0; j < annotations.size(); j++) {
        PDAnnotation annot = annotations.get(j);
        if (annot instanceof PDAnnotationLink) {
            PDAnnotationLink link = (PDAnnotationLink) annot;
            PDRectangle rect = link.getRectangle();
            // need to reposition link rectangle to match text space
            float x = rect.getLowerLeftX();
            float y = rect.getUpperRightY();
            float width = rect.getWidth();
            float height = rect.getHeight();
            int rotation = page.getRotation();
            if (rotation == 0) {
                PDRectangle pageSize = page.getMediaBox();
                y = pageSize.getHeight() - y;
            } else if (rotation == 90) {
                // handle rotation if needed
            }
            Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
            stripper.addRegion("" + j, awtRect);
        }
    }
    stripper.extractRegions(page);
    for (int j = 0; j < annotations.size(); j++) {
        PDAnnotation annot = annotations.get(j);
        if (annot instanceof PDAnnotationLink) {
            PDAnnotationLink link = (PDAnnotationLink) annot;
            PDAction action = link.getAction();
            String urlText = stripper.getTextForRegion("" + j);
            if (action instanceof PDActionURI) {
                PDActionURI uri = (PDActionURI) action;
                System.out.println("Page " + pageNum + ":'" + urlText.trim() + "'=" + uri.getURI());
            }
        }
    }
}

That’s all. Happy coding!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Apache PDFBox Java PDF extraction PDF hyperlink extraction PDF image extraction PDF text extraction

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.