Master PDF Extraction and Creation in Python: Text, Tables, Images, and More

This tutorial walks you through using Python libraries such as PyPDF2, Tabula, PyMuPDF, Pillow, and fpdf2 to extract text, tables, and images from PDF files and to write new PDFs, complete with code examples, step‑by‑step explanations, and sample outputs.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Master PDF Extraction and Creation in Python: Text, Tables, Images, and More

Extract Text from PDFs

Python offers several libraries for reading PDFs; the most popular are PyPDF2 and Pdfminer . Below is a basic example using PyPDF2 to open a PDF, read a page, and extract its text.

import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()

The code imports the module, opens the file in binary mode, creates a PdfFileReader object, retrieves the first page, extracts its text with extractText(), and finally closes the file.

To extract text from all pages, iterate over the page count:

import PyPDF2
pdfFileObj = open('file.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
for i in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(i)
    print(pageObj.extractText())
pdfFileObj.close()

This prints the text of each page sequentially.

Extract Tables from PDFs

While PyPDF2 can read raw table data, it does not preserve table structure. For proper table extraction, use the Tabula library, which leverages computer‑vision techniques to detect tables and convert them to DataFrame objects.

import tabula
df = tabula.read_pdf("test.pdf", pages='all')

To directly save tables as CSV files, use tabula.convert_into:

import tabula
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

Extract Images from PDFs

Images require a different approach. Install PyMuPDF (the fitz module) and Pillow for image handling:

pip install PyMuPDF Pillow
import fitz
import io
from PIL import Image
pdf_file = fitz.open("test2.pdf")
for page_index in range(len(pdf_file)):
    page = pdf_file[page_index]
    for image_index, img in enumerate(page.getImageList(), start=1):
        xref = img[0]
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]
        image = Image.open(io.BytesIO(image_bytes))
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

This extracts each image from every page and saves it with an appropriate file extension.

Write PDFs with Python

To create new PDFs, use the fpdf2 library.

pip install fpdf2
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.set_font("Arial", size=15)
pdf.cell(200, 10, txt="Medium Article", ln=1, align='C')
pdf.cell(200, 10, txt="How To Read and Write PDF files in Python", ln=2, align='C')
pdf.output("medium.pdf")

The code creates a PDF object, adds a page, sets the font, writes two centered text cells, and saves the file as medium.pdf.

Sample output of the generated PDF:

Generated PDF preview
Generated PDF preview

These examples demonstrate how to extract and manipulate PDF content programmatically using Python.

PythonPDFpypdf2fpdf2Tabula
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.