Fundamentals 9 min read

Python Script for Extracting Text from PDF Files Using PyPDF2

This article introduces a Python utility built with PyPDF2 that extracts text from PDF files, saves it as a TXT file, and provides an interactive command‑line interface with error handling, usage instructions, and code examples for easy document processing.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Python Script for Extracting Text from PDF Files Using PyPDF2

In the digital age, PDF files are ubiquitous, and extracting their text programmatically can save time compared to manual copying. This guide presents a Python tool based on the PyPDF2 library that reads PDF files, extracts all page text, and writes the output to a similarly named TXT file.

Background and Requirements

Common scenarios for converting PDFs to plain text include importing e‑book content into note‑taking apps, extracting report data for analysis, and preparing data for natural language processing tasks.

Feature Overview

Text Extraction: Reads each page of a PDF and extracts its text.

File Handling: Saves the extracted text to a TXT file using UTF‑8 encoding.

Error Management: Handles missing files, non‑PDF formats, and other exceptions with clear messages.

Interactive Interface: Prompts the user for a file path and allows repeated processing or graceful exit.

Technical Implementation

Dependencies

os : For file path operations.

PyPDF2 : For reading PDF files and extracting text.

Installation

Install PyPDF2 via:

<code>pip install PyPDF2</code>

Core Function: pdf_to_txt(pdf_path)

Function: Extracts text from the specified PDF and saves it as a TXT file.

Logic: Verify the file exists and has a .pdf extension. Open the PDF with PdfReader and determine the number of pages. Iterate over each page, calling extract_text() and concatenating results. Write the combined text to a TXT file with the same base name. Return a boolean indicating success.

Error Handling: FileNotFoundError for missing files. ValueError for non‑PDF inputs. General Exception for other issues.

Entry Point: main()

Displays a welcome message and prompts the user for a PDF path (or 'q' to quit).

Calls pdf_to_txt and, on success, asks whether to process another file.

Handles user choices to continue or exit.

Usage Instructions

Ensure Python and PyPDF2 are installed.

Save the script as pdf_to_txt.py .

Run it from the terminal with python pdf_to_txt.py and follow the prompts.

Important Notes

The extract_text() method works only on PDFs that contain actual text; scanned image PDFs require OCR tools such as Tesseract.

UTF‑8 encoding is used to support multilingual content.

Existing TXT files with the same name will be overwritten.

Full Code

<code>import PyPDF2
import os

def pdf_to_txt(pdf_path):
    try:
        # Check file existence
        if not os.path.exists(pdf_path):
            raise FileNotFoundError("指定的PDF文件未找到")
        # Check file extension
        if not pdf_path.lower().endswith('.pdf'):
            raise ValueError("文件必须是PDF格式")
        file_name = os.path.splitext(pdf_path)[0]
        txt_path = f"{file_name}.txt"
        # Open PDF
        with open(pdf_path, 'rb') as pdf_file:
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            num_pages = len(pdf_reader.pages)
            text = ""
            for page_num in range(num_pages):
                page = pdf_reader.pages[page_num]
                text += page.extract_text() + "\n"
        # Write to TXT
        with open(txt_path, 'w', encoding='utf-8') as txt_file:
            txt_file.write(text)
        print(f"\n成功提取 {num_pages} 页内容!")
        print(f"文字已保存到: {txt_path}")
        return True
    except FileNotFoundError as e:
        print(f"\n错误: {str(e)}")
        return False
    except ValueError as e:
        print(f"\n错误: {str(e)}")
        return False
    except Exception as e:
        print(f"\n发生错误: {str(e)}")
        return False

def main():
    print("欢迎使用 PDF 文字提取工具!")
    print("请输入完整的 PDF 文件路径(或输入 'q' 退出)")
    while True:
        pdf_path = input("\nPDF 文件路径: ").strip()
        if pdf_path.lower() == 'q':
            print("程序已退出")
            break
        success = pdf_to_txt(pdf_path)
        if success:
            while True:
                choice = input("\n是否继续处理其他文件?(y/n): ").lower().strip()
                if choice in ['y', 'n']:
                    break
                print("请输入 'y' 或 'n'")
            if choice == 'n':
                print("程序已退出")
                break
        else:
            print("请检查文件路径后重试")

if __name__ == "__main__":
    main()
</code>

Conclusion

This simple tool demonstrates Python's practicality in document processing. By leveraging PyPDF2, users can quickly extract text from PDFs and handle the results in a user‑friendly way. For large‑scale tasks, the script can be extended to support batch processing or integrated with OCR for scanned documents.

PDFcommand linescriptPyPDF2text-extraction
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.