Fundamentals 13 min read

Mastering PDF Manipulation in Python with PyPDF2

This article introduces the PDF format, surveys popular Python PDF libraries, and provides a step‑by‑step guide to installing PyPDF2, extracting metadata and text, rotating, merging, splitting, encrypting, and watermarking PDF files using concrete code examples and explanations.

Data STUDIO
Data STUDIO
Data STUDIO
Mastering PDF Manipulation in Python with PyPDF2

Why PDFs Matter and How Python Can Help

PDF (Portable Document Format) is an ISO‑standard file type that preserves layout across platforms, making it the preferred format for document distribution, academic publishing, and business communication.

Popular Python PDF Libraries

PDFMiner : Open‑source text‑extraction tool.

PDFQuery : Lightweight wrapper around PDFMiner, ixml and PyQuery.

Tabula.py : Python wrapper for tabula‑java that converts PDFs to Pandas DataFrames.

Xpdf : Converts PDFs to plain text.

pdflib : Python bindings for the poppler library.

Slate : PDFMiner‑based text‑extraction package.

PyPDF2 : Pure‑Python library for extracting information, merging, splitting, adding watermarks, and encrypting PDFs.

Getting Started with PyPDF2

PyPDF2 is a fully Python‑based library that runs on any platform without external dependencies. It offers a dual‑API design: a low‑level API inspired by Pygments for fine‑grained document manipulation and a high‑level API influenced by ReportLab for rapid PDF creation.

Key Features

Convert PDFs to PNG/JPEG or plain‑text files.

Create new PDFs from scratch.

Modify existing PDFs by adding, deleting, or reordering pages.

Advanced editing such as page rotation, watermarking, and font adjustments.

Digital signatures when a certificate is available.

Installation

pip install PyPDF2

Retrieving Document Metadata

PyPDF2 can read metadata fields such as author, title, creator, and producer.

from PyPDF2 import PdfFileReader
pdf_path = r"Tesseractexample.pdf"
with open(pdf_path, 'rb') as f:
    pdf = PdfFileReader(f)
    info = pdf.getDocumentInfo()
    print("Author: " + info.author)
    print("Creator: " + info.creator)
    print("Producer: " + info.producer)

Extracting Text (Limitations)

PyPDF2’s text extraction is limited; output may contain many line breaks and irregular spacing.

# Create a PDF reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
text = ''
for i in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(i)
    text += pageObj.extractText()
print(text)
Extracted text example
Extracted text example

Rotating Pages

pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)
Rotated page example
Rotated page example

Merging PDFs

Combine multiple PDFs into a single document.

pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)
Merged PDF example
Merged PDF example

Splitting PDFs

fname = os.path.splitext(os.path.basename(pdf_path))[0]
for page in range(pdf.getNumPages()):
    pdfwrite = PdfFileWriter()
    pdfwrite.addPage(pdf.getPage(page))
    outputfilename = f"{fname}_page_{page+1}.pdf"
    with open(outputfilename, 'wb') as out:
        pdfwrite.write(out)
    print('Created: {}'.format(outputfilename))
pdf = PdfFileReader(pdf_path)

Encrypting PDFs

Add a password to protect a PDF.

for page in range(pdf.getNumPages()):
    pdfwrite.addPage(pdf.getPage(page))
    pdfwrite.encrypt(user_pwd=password, owner_pwd=None, use_128bit=True)
with open(outputpdf, 'wb') as fh:
    pdfwrite.write(fh)
Encryption dialog
Encryption dialog

Adding Watermarks

originalfile = r"C:UsersDellDesktopTesting Tesseractexample.pdf"
watermark = r"C:UsersDellDesktopTesting Tesseractwatermark.pdf"
watermarkedfile = r"C:UsersDellDesktopTesting Tesseractwatermarkedfile.pdf"
watermark = PdfFileReader(watermark)
watermarkpage = watermark.getPage(0)
pdf = PdfFileReader(originalfile)
pdfwrite = PdfFileWriter()
for page in range(pdf.getNumPages()):
    pdfpage = pdf.getPage(page)
    pdfpage.mergePage(watermarkpage)
    pdfwrite.addPage(pdfpage)
with open(watermarkedfile, 'wb') as fh:
    pdfwrite.write(fh)
Watermarked PDF example
Watermarked PDF example

Conclusion

PyPDF2 is an open‑source, BSD‑licensed library that runs on any OS, requires only a single pip install, and provides a comprehensive set of tools for extracting, modifying, merging, splitting, encrypting, and watermarking PDFs. Its lightweight design, thread‑safety, and extensive documentation on GitHub make it a practical choice for developers looking to automate PDF workflows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonPDF manipulationPyPDF2PDF extractionPDF watermarkPDF mergingPDF encryption
Data STUDIO
Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.