Mastering PDF Manipulation in Python with PyPDF2
This article introduces the PDF format, surveys popular Python PDF libraries, and provides a step‑by‑step guide to installing PyPDF2, extracting metadata and text, rotating, merging, splitting, encrypting, and watermarking PDF files using concrete code examples and explanations.
Why PDFs Matter and How Python Can Help
PDF (Portable Document Format) is an ISO‑standard file type that preserves layout across platforms, making it the preferred format for document distribution, academic publishing, and business communication.
Popular Python PDF Libraries
PDFMiner : Open‑source text‑extraction tool.
PDFQuery : Lightweight wrapper around PDFMiner, ixml and PyQuery.
Tabula.py : Python wrapper for tabula‑java that converts PDFs to Pandas DataFrames.
Xpdf : Converts PDFs to plain text.
pdflib : Python bindings for the poppler library.
Slate : PDFMiner‑based text‑extraction package.
PyPDF2 : Pure‑Python library for extracting information, merging, splitting, adding watermarks, and encrypting PDFs.
Getting Started with PyPDF2
PyPDF2 is a fully Python‑based library that runs on any platform without external dependencies. It offers a dual‑API design: a low‑level API inspired by Pygments for fine‑grained document manipulation and a high‑level API influenced by ReportLab for rapid PDF creation.
Key Features
Convert PDFs to PNG/JPEG or plain‑text files.
Create new PDFs from scratch.
Modify existing PDFs by adding, deleting, or reordering pages.
Advanced editing such as page rotation, watermarking, and font adjustments.
Digital signatures when a certificate is available.
Installation
pip install PyPDF2Retrieving Document Metadata
PyPDF2 can read metadata fields such as author, title, creator, and producer.
from PyPDF2 import PdfFileReader
pdf_path = r"Tesseractexample.pdf"
with open(pdf_path, 'rb') as f:
pdf = PdfFileReader(f)
info = pdf.getDocumentInfo()
print("Author: " + info.author)
print("Creator: " + info.creator)
print("Producer: " + info.producer)Extracting Text (Limitations)
PyPDF2’s text extraction is limited; output may contain many line breaks and irregular spacing.
# Create a PDF reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
text = ''
for i in range(pdfReader.numPages):
pageObj = pdfReader.getPage(i)
text += pageObj.extractText()
print(text)Rotating Pages
pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
pdf_write.write(fh)Merging PDFs
Combine multiple PDFs into a single document.
pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
pdf_write.write(fh)Splitting PDFs
fname = os.path.splitext(os.path.basename(pdf_path))[0]
for page in range(pdf.getNumPages()):
pdfwrite = PdfFileWriter()
pdfwrite.addPage(pdf.getPage(page))
outputfilename = f"{fname}_page_{page+1}.pdf"
with open(outputfilename, 'wb') as out:
pdfwrite.write(out)
print('Created: {}'.format(outputfilename))
pdf = PdfFileReader(pdf_path)Encrypting PDFs
Add a password to protect a PDF.
for page in range(pdf.getNumPages()):
pdfwrite.addPage(pdf.getPage(page))
pdfwrite.encrypt(user_pwd=password, owner_pwd=None, use_128bit=True)
with open(outputpdf, 'wb') as fh:
pdfwrite.write(fh)Adding Watermarks
originalfile = r"C:UsersDellDesktopTesting Tesseractexample.pdf"
watermark = r"C:UsersDellDesktopTesting Tesseractwatermark.pdf"
watermarkedfile = r"C:UsersDellDesktopTesting Tesseractwatermarkedfile.pdf"
watermark = PdfFileReader(watermark)
watermarkpage = watermark.getPage(0)
pdf = PdfFileReader(originalfile)
pdfwrite = PdfFileWriter()
for page in range(pdf.getNumPages()):
pdfpage = pdf.getPage(page)
pdfpage.mergePage(watermarkpage)
pdfwrite.addPage(pdfpage)
with open(watermarkedfile, 'wb') as fh:
pdfwrite.write(fh)Conclusion
PyPDF2 is an open‑source, BSD‑licensed library that runs on any OS, requires only a single pip install, and provides a comprehensive set of tools for extracting, modifying, merging, splitting, encrypting, and watermarking PDFs. Its lightweight design, thread‑safety, and extensive documentation on GitHub make it a practical choice for developers looking to automate PDF workflows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data STUDIO
Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
