Introduction to PyMuPDF: Features, Installation, and Usage Guide
This article provides a comprehensive overview of PyMuPDF, the Python binding for MuPDF, covering its core features, supported document formats, installation methods, and detailed code examples for opening, rendering, extracting, and manipulating PDF and other documents.
PyMuPDF Overview
PyMuPDF is the Python binding for the MuPDF library, offering a lightweight, high‑performance engine for viewing and manipulating PDF, XPS, OpenXPS, CBZ, EPUB, FictionBook 2 and several image formats.
Key Features
Decrypt files and access metadata, links, and bookmarks.
Render pages as raster images ( PNG ) or vector graphics ( SVG ).
Search for text, extract text in various formats (plain, HTML, XML, JSON, etc.) and extract images.
Convert documents to other formats such as HTML, SVG, PDF, XML, JSON, and plain text.
Fully support embedded files, password protection, annotations, and form fields.
Command‑line utilities for encryption/decryption, optimization, sub‑document creation, document concatenation, and layout‑preserving text extraction.
Installation
PyMuPDF can be installed from source or via pre‑built wheels on PyPI. It works on Windows, Linux and macOS for Python 3.6‑3.9 (64‑bit) and also provides optional dependencies such as Pillow , fontTools and pymupdf‑fonts for extended functionality.
<code>pip install PyMuPDF</code>Basic Usage
Import the library (the import name is fitz for historical reasons) and open a document:
<code>import fitz
doc = fitz.open("sample.pdf") # creates a Document object</code>You can iterate over pages, load a specific page, or use the document as a context manager.
<code>for page in doc:
# process each page
pass
page = doc.load_page(0) # or doc[0]
</code>Page Operations
Render a page to a pixmap: pix = page.get_pixmap() and save as PNG: pix.save("page-%i.png" % page.number) .
Render to SVG: svg = page.get_svg_image() .
Extract text: text = page.get_text("text") or use other options such as "html" , "xml" , "json" , "blocks" , etc.
Search for a string: areas = page.search_for("mupdf") returns a list of rectangles.
Access links, annotations, and form fields via page.get_links() , page.annots() , and page.widgets() .
PDF‑Specific Operations
Only PDF documents can be modified (e.g., insert, delete, move, or rotate pages). Use methods like Document.delete_page() , Document.insert_page() , Document.save() (with incremental=True for fast updates), and Document.close() to finalize changes.
Document Concatenation and Splitting
Combine PDFs with Document.insert_pdf() or extract subsets by creating a new empty document and inserting selected pages.
<code># Append doc2 to doc1
doc1.insert_pdf(doc2)
# Create a new PDF with first 10 and last 10 pages of doc1
new_doc = fitz.open()
new_doc.insert_pdf(doc1, to_page=9)
new_doc.insert_pdf(doc1, from_page=len(doc1)-10)
new_doc.save("first-and-last-10.pdf")
</code>The library provides a rich API for low‑level PDF structure manipulation, metadata access, and conversion to other formats, making it suitable for a wide range of document‑processing tasks.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.