Fundamentals 12 min read

Introduction to PyMuPDF: Installation, Features, and Usage Guide

This article provides a comprehensive overview of PyMuPDF, the Python binding for MuPDF, covering its installation, core features, document and page manipulation methods, text and image extraction, PDF editing capabilities, and essential code examples for practical use.

Python Programming Learning Circle

Sep 28, 2023

Introduction to PyMuPDF: Installation, Features, and Usage Guide

PyMuPDF is the Python interface to MuPDF, a lightweight viewer and library for PDF, XPS, EPUB, CBZ, and other document formats, offering high‑quality anti‑aliased rendering and fast performance.

The library supports a wide range of functions including decryption, metadata access, raster (PNG) and vector (SVG) rendering, text search, extraction of text and images, conversion to formats such as HTML, XML, JSON, and full PDF manipulation like creating, merging, splitting, rotating, and annotating pages.

Installation is straightforward via pip install PyMuPDF; optional dependencies such as Pillow, fontTools, and pymupdf‑fonts enhance functionality for image saving and font handling.

Basic usage starts with importing the library ( import fitz) and checking the version ( print(fitz.__doc__)). A document is opened with doc = fitz.open(filename), after which Document methods like doc.page_count, doc.metadata, doc.get_toc(), and doc.load_page(pno) provide access to pages and their properties.

Pages can be accessed directly ( page = doc[pno]) or loaded ( page = doc.load_page(pno)). Iterating over pages is supported ( for page in doc:), and each page offers methods to retrieve links ( links = page.get_links()), annotations ( for annot in page.annots():), and widgets ( for field in page.widgets():).

Rendering a page to an image is done with pix = page.get_pixmap(), and the resulting pixmap can be saved ( pix.save("page-%i.png" % page.number)) or converted to SVG via page.get_svg_image(). The pixmap object also provides size and color‑space information.

Text and image extraction uses page.get_text(opt) where opt can be "text", "blocks", "words", "html", "json", "xml", etc., allowing fine‑grained control over the output format. Searching for a string on a page is performed with areas = page.search_for("mupdf"), returning rectangles that locate each occurrence.

PDF‑specific operations include deleting, copying, moving, and inserting pages via methods such as Document.delete_page(), Document.copy_page(), Document.move_page(), Document.insert_page(), and Document.new_page(). Documents can be merged or split using Document.insert_pdf() and by creating new empty PDFs with fitz.open(). Saving changes is done with Document.save(), optionally using incremental=True for fast incremental updates.

Finally, resources are released by calling Document.close(), which closes the underlying file and frees associated buffers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MuPDF PyMuPDF DocumentProcessing

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.