Introduction to PyMuPDF: Features, Installation, and Usage Guide
This article provides a comprehensive overview of PyMuPDF, the Python binding for MuPDF, covering its core features, supported document formats, installation methods, essential API usage, and practical examples for manipulating PDFs and other document types with code snippets.
1. PyMuPDF Overview
Before introducing PyMuPDF, it is useful to understand MuPDF, a lightweight PDF, XPS, and e‑book viewer library. MuPDF consists of a core library, command‑line tools, and platform‑specific viewers, offering high‑quality anti‑aliased rendering and precise layout metrics.
MuPDF supports many document formats such as PDF, XPS, OpenXPS, CBZ, EPUB, and FictionBook 2. It allows annotation and form filling on PDF documents, and its command‑line tools can convert documents to HTML, SVG, PDF, CBZ, etc., as well as run JavaScript scripts for document manipulation. PyMuPDF (current version 1.18.17) is the Python binding for MuPDF (current version 1.18.*). It provides access to file extensions ".pdf", ".xps", ".oxps", ".cbz", ".fb2", and ".epub", and can also handle about ten popular image formats such as ".png", ".jpg", ".bmp", and ".tiff" as if they were documents.
2. Features
Decrypt files
Access metadata, links, and bookmarks
Render pages as raster images (e.g., PNG) or vector formats (SVG)
Search text
Extract text and images
Convert to other formats: PDF, (X)HTML, XML, JSON, plain text, etc.
Create, merge, or split PDF pages; insert, delete, rearrange, or modify pages, annotations, and form fields
Extract or insert images and fonts
Full support for embedded files
Re‑format PDFs for duplex printing, spot colors, watermarks, or logos
Comprehensive password protection (encrypt, decrypt, set permissions, user/owner passwords)
Support optional content for images, text, and drawings
Access and modify low‑level PDF structure
Command‑line utility python -m fitz … with features such as encryption/decryption, sub‑document creation, document concatenation, image/font extraction, and layout‑preserving text extraction
New: layout‑preserving text extraction! The script fitzcliy.py with the sub‑command gettext can output text that closely matches the original physical layout, including surrounding images and multi‑column tables.
3. Installation
PyMuPDFcan be installed from source or via pre‑built wheels. Wheels are available on PyPI for Windows, Linux, and macOS, supporting Python 3.6‑3.9 (64‑bit) and, more recently, many‑linux2014_aarch64 for ARM platforms.
Optional dependencies for enhanced functionality include: Pillow – required for Pixmap.pil_save() and
Pixmap.pil_tobytes() fontTools– required for
Document.subset_fonts() pymupdf-fonts– provides a collection of fonts for text output
Install with: pip install PyMuPDF Import the library using the historic name fitz:
import fitz4. Basic Usage
4.1 Import and Check Version
import fitz
print(fitz.__doc__)
# Example output:
# PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
# Version date: 2021-08-05 00:00:01.
# Built for Python 3.8 on linux (64‑bit).4.2 Open a Document
doc = fitz.open(filename) # filename must be an existing file pathThe call returns a Document object. Documents can also be opened from memory or created as empty PDFs, and can be used as context managers.
4.3 Document Methods and Properties
Method/Property
Description Document.page_count Number of pages (int) Document.metadata Metadata dictionary Document.get_toc() Retrieve table of contents (list) Document.load_page() Load a specific page
Example:
>> doc.page_count
1
>>> doc.metadata
{'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Foxit Reader PDF Printer 10.0.130.3456', 'creationDate': "D:20210810173328+08'00'", 'modDate': "D:20210810173328+08'00'", 'trapped': '', 'encryption': None}4.4 Retrieve Metadata
Document.metadatareturns a Python dict with keys such as producer, format, author, creationDate, etc. Not all fields are guaranteed to contain data for every document.
4.5 Get Table of Contents
toc = doc.get_toc()4.6 Working with Pages
Pages are the core of MuPDF functionality. You can render a page as a raster image:
pix = page.get_pixmap() pixis a Pixmap object containing the RGB (or RGBA) image data. Various options allow control over resolution, colorspace, alpha channel, rotation, cropping, etc.
Save the raster image: pix.save("page-%i.png" % page.number) Or obtain a vector SVG image:
svg = page.get_svg_image()Extract Text and Images
text = page.get_text(opt)The opt argument can be one of: "text" – plain text with line breaks "blocks" – list of text blocks (paragraphs) "words" – list of words without spaces "html" – full visual HTML representation "dict" / "json" – same information as HTML but as Python dict or JSON string "rawdict" / "rawjson" – superset including XML‑style character details "xhtml" – text version with embedded images "xml" – text with full character position and font information
Search Text
areas = page.search_for("mupdf")This returns a list of rectangles where the string "mupdf" (case‑insensitive) occurs, useful for highlighting or cross‑referencing.
5. PDF‑Specific Operations
Only PDF documents can be modified with PyMuPDF. Other formats are read‑only, but any document can be converted to PDF using Document.convert_to_pdf().
Save changes with Document.save(). You can choose incremental saving (fast, appends changes to the original file) or create a new file.
5.1 Page Manipulation
Document.delete_page()/ Document.delete_pages() – remove pages Document.copy_page() / Document.fullcopy_page() / Document.move_page() – copy or move pages within the same document Document.select() – keep only selected page numbers, effectively creating a new PDF Document.insert_page() / Document.new_page() – insert new pages
5.2 Concatenating and Splitting PDFs
Append another PDF:
# Append complete doc2 to the end of doc1
doc1.insert_pdf(doc2)Split a document (first 10 pages and last 10 pages):
doc2 = fitz.open() # new empty PDF
doc2.insert_pdf(doc1, to_page=9) # first 10 pages
doc2.insert_pdf(doc1, from_page=len(doc1)-10) # last 10 pages
doc2.save("first-and-last-10.pdf")5.3 Closing Documents
When finished, close the document to release file handles and buffers:
doc.close()6. Additional Resources
The article also includes promotional QR codes for a free Python public course and links to related Python tutorials, but the technical content above constitutes the core instructional material.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
