Backend Development 9 min read

Master PDF Extraction and Creation in Python: Text, Tables, Images, and More

This tutorial walks you through using Python libraries such as PyPDF2, Tabula, PyMuPDF, Pillow, and fpdf2 to extract text, tables, and images from PDF files and to write new PDFs, complete with code examples, step‑by‑step explanations, and sample outputs.

Python Programming Learning Circle

Sep 4, 2025

Master PDF Extraction and Creation in Python: Text, Tables, Images, and More

Extract Text from PDFs

Python offers several libraries for reading PDFs; the most popular are PyPDF2 and Pdfminer . Below is a basic example using PyPDF2 to open a PDF, read a page, and extract its text.

import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()

The code imports the module, opens the file in binary mode, creates a PdfFileReader object, retrieves the first page, extracts its text with extractText(), and finally closes the file.

To extract text from all pages, iterate over the page count:

import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for i in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(i)
    print(pageObj.extractText())
pdfFileObj.close()

This prints the text of each page sequentially.

Extract Tables from PDFs

While PyPDF2 can read raw table data, it does not preserve table structure. For proper table extraction, use the Tabula library, which leverages computer‑vision techniques to detect tables and convert them to DataFrame objects.

import tabula
df = tabula.read_pdf("test.pdf", pages='all')

To directly save tables as CSV files, use tabula.convert_into:

import tabula
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

Extract Images from PDFs

Images require a different approach. Install PyMuPDF (the fitz module) and Pillow for image handling:

pip install PyMuPDF Pillow

import fitz
import io
from PIL import Image
pdf_file = fitz.open("test2.pdf")
for page_index in range(len(pdf_file)):
    page = pdf_file[page_index]
    for image_index, img in enumerate(page.getImageList(), start=1):
        xref = img[0]
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]
        image = Image.open(io.BytesIO(image_bytes))
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

This extracts each image from every page and saves it with an appropriate file extension.

Write PDFs with Python

To create new PDFs, use the fpdf2 library.

pip install fpdf2

from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=15)
pdf.cell(200, 10, txt="Medium Article", ln=1, align='C')
pdf.cell(200, 10, txt="How To Read and Write PDF files in Python", ln=2, align='C')
pdf.output("medium.pdf")

The code creates a PDF object, adds a page, sets the font, writes two centered text cells, and saves the file as medium.pdf.

Sample output of the generated PDF:

These examples demonstrate how to extract and manipulate PDF content programmatically using Python.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python pdf PyPDF2 fpdf2 Tabula

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.