How to Extract Text, Links, and Images from PDFs with Apache PDFBox (Java)
This guide shows how to use the open‑source Java library Apache PDFBox to programmatically extract plain text, hyperlinks, and embedded images from PDF documents, complete with step‑by‑step code examples for each task.
If you could automatically extract PDF content, a programmer’s life would be much easier. Java provides the open‑source library Apache PDFBox for this purpose.
What is PDFBox
Apache PDFBox is a Java library for handling PDF documents. It can create new PDFs, update existing ones (e.g., add styles or hyperlinks), and extract content such as text, links, and images.
Extracting Text from a PDF
When you can pull the raw text from a PDF, half the problem is solved. The class PDFTextStripper removes all formatting and returns plain text.
tStripper = new PDFTextStripper();
tStripper.setStartPage(1);
tStripper.setEndPage(3);
PDDocument document = PDDocument.load(new File("youpdfname.pdf"));
if (!document.isEncrypted()) {
pdfFileInText = tStripper.getText(document);
lines = pdfFileInText.split("\\r\
\\r\
");
for (String line : lines) {
System.out.println(line);
content += line;
}
}
System.out.println(content.trim());Extracting All Hyperlinks from a PDF
PDFBox’s PDPage class provides getAnnotations() to retrieve annotation lists. By inspecting PDAnnotationLink objects and their PDActionURI, you can list every hyperlink.
PDDocument document = PDDocument.load(new File("name.pdf"));
PDPage pdfpage = document.getPage(1);
annotations = pdfpage.getAnnotations();
for (int j = 0; j < annotations.size(); j++) {
PDAnnotation annot = annotations.get(j);
if (annot instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annot;
PDAction action = link.getAction();
if (action instanceof PDActionURI) {
PDActionURI uri = (PDActionURI) action;
urls += uri.getURI();
System.out.println(uri.getURI());
}
}
}Exporting Images from a PDF
Beyond text and links, PDFBox can extract embedded images. Using PDPage.getResources() you can iterate over XObjects and save each image as a PNG file.
PDDocument document = PDDocument.load(new File("name.pdf"));
PDPage pdfpage = document.getPage(1);
int i = 1;
PDResources pdResources = pdfpage.getResources();
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c);
if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
File file = new File(i + ".png");
i++;
ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) o).getImage(), "png", file);
}
}Getting the Words Inside Hyperlink Annotations
To retrieve the visible text of each hyperlink, use PDFTextStripperByArea with rectangles derived from the link’s PDRectangle. The following example extracts the text and prints it together with the URL.
PDDocument document = PDDocument.load(new File("name.pdf"));
int pageNum = 0;
for (PDPage page : doc.getPages()) {
pageNum++;
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
List<PDAnnotation> annotations = page.getAnnotations();
// first setup text extraction regions
for (int j = 0; j < annotations.size(); j++) {
PDAnnotation annot = annotations.get(j);
if (annot instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annot;
PDRectangle rect = link.getRectangle();
// need to reposition link rectangle to match text space
float x = rect.getLowerLeftX();
float y = rect.getUpperRightY();
float width = rect.getWidth();
float height = rect.getHeight();
int rotation = page.getRotation();
if (rotation == 0) {
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
} else if (rotation == 90) {
// handle rotation if needed
}
Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("" + j, awtRect);
}
}
stripper.extractRegions(page);
for (int j = 0; j < annotations.size(); j++) {
PDAnnotation annot = annotations.get(j);
if (annot instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annot;
PDAction action = link.getAction();
String urlText = stripper.getTextForRegion("" + j);
if (action instanceof PDActionURI) {
PDActionURI uri = (PDActionURI) action;
System.out.println("Page " + pageNum + ":'" + urlText.trim() + "'=" + uri.getURI());
}
}
}
}That’s all. Happy coding!
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
