Fundamentals 13 min read

Mastering PDF Manipulation in Python with PyPDF2

This article introduces the PDF format, surveys popular Python PDF libraries, and provides a step‑by‑step guide to installing PyPDF2, extracting metadata and text, rotating, merging, splitting, encrypting, and watermarking PDF files using concrete code examples and explanations.

Data STUDIO

Oct 10, 2025

Mastering PDF Manipulation in Python with PyPDF2

Why PDFs Matter and How Python Can Help

PDF (Portable Document Format) is an ISO‑standard file type that preserves layout across platforms, making it the preferred format for document distribution, academic publishing, and business communication.

Popular Python PDF Libraries

PDFMiner : Open‑source text‑extraction tool.

PDFQuery : Lightweight wrapper around PDFMiner, ixml and PyQuery.

Tabula.py : Python wrapper for tabula‑java that converts PDFs to Pandas DataFrames.

Xpdf : Converts PDFs to plain text.

pdflib : Python bindings for the poppler library.

Slate : PDFMiner‑based text‑extraction package.

PyPDF2 : Pure‑Python library for extracting information, merging, splitting, adding watermarks, and encrypting PDFs.

Getting Started with PyPDF2

PyPDF2 is a fully Python‑based library that runs on any platform without external dependencies. It offers a dual‑API design: a low‑level API inspired by Pygments for fine‑grained document manipulation and a high‑level API influenced by ReportLab for rapid PDF creation.

Key Features

Convert PDFs to PNG/JPEG or plain‑text files.

Create new PDFs from scratch.

Modify existing PDFs by adding, deleting, or reordering pages.

Advanced editing such as page rotation, watermarking, and font adjustments.

Digital signatures when a certificate is available.

Installation

pip install PyPDF2

Retrieving Document Metadata

PyPDF2 can read metadata fields such as author, title, creator, and producer.

from PyPDF2 import PdfFileReader
pdf_path = r"Tesseractexample.pdf"
with open(pdf_path, 'rb') as f:
    pdf = PdfFileReader(f)
    info = pdf.getDocumentInfo()
    print("Author: " + info.author)
    print("Creator: " + info.creator)
    print("Producer: " + info.producer)

Extracting Text (Limitations)

PyPDF2’s text extraction is limited; output may contain many line breaks and irregular spacing.

# Create a PDF reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
text = ''
for i in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(i)
    text += pageObj.extractText()
print(text)

Rotating Pages

pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)

Merging PDFs

Combine multiple PDFs into a single document.

pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)

Splitting PDFs

fname = os.path.splitext(os.path.basename(pdf_path))[0]
for page in range(pdf.getNumPages()):
    pdfwrite = PdfFileWriter()
    pdfwrite.addPage(pdf.getPage(page))
    outputfilename = f"{fname}_page_{page+1}.pdf"
    with open(outputfilename, 'wb') as out:
        pdfwrite.write(out)
    print('Created: {}'.format(outputfilename))
pdf = PdfFileReader(pdf_path)

Encrypting PDFs

Add a password to protect a PDF.

for page in range(pdf.getNumPages()):
    pdfwrite.addPage(pdf.getPage(page))
    pdfwrite.encrypt(user_pwd=password, owner_pwd=None, use_128bit=True)
with open(outputpdf, 'wb') as fh:
    pdfwrite.write(fh)

Adding Watermarks

originalfile = r"C:UsersDellDesktopTesting Tesseractexample.pdf"
watermark = r"C:UsersDellDesktopTesting Tesseractwatermark.pdf"
watermarkedfile = r"C:UsersDellDesktopTesting Tesseractwatermarkedfile.pdf"
watermark = PdfFileReader(watermark)
watermarkpage = watermark.getPage(0)
pdf = PdfFileReader(originalfile)
pdfwrite = PdfFileWriter()
for page in range(pdf.getNumPages()):
    pdfpage = pdf.getPage(page)
    pdfpage.mergePage(watermarkpage)
    pdfwrite.addPage(pdfpage)
with open(watermarkedfile, 'wb') as fh:
    pdfwrite.write(fh)

Conclusion

PyPDF2 is an open‑source, BSD‑licensed library that runs on any OS, requires only a single pip install, and provides a comprehensive set of tools for extracting, modifying, merging, splitting, encrypting, and watermarking PDFs. Its lightweight design, thread‑safety, and extensive documentation on GitHub make it a practical choice for developers looking to automate PDF workflows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python PDF manipulation PyPDF2 PDF extraction PDF watermark PDF merging PDF encryption

Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.