Fundamentals 5 min read

Convert PDF Text, Tables, and Images to Word Using Python

This guide explains how to use Python libraries such as pdf2docx, pdfplumber, python-docx, and Pillow to extract text, tables, and images from a PDF and reconstruct them in a Word document, including installation steps and a complete example script.

Test Development Learning Exchange
Test Development Learning Exchange
Test Development Learning Exchange
Convert PDF Text, Tables, and Images to Word Using Python

Converting the text, tables, and images from a PDF file into a Word document is a complex task that requires precise parsing and reconstruction.

Required tools : The Python libraries pdf2docx (for PDF‑to‑Word conversion), pdfplumber (to extract text and tables), python-docx (to create and edit Word files), and Pillow (for image handling) must be installed, e.g., via pip install pdf2docx pdfplumber python-docx Pillow .

Sample code : The following script demonstrates how to (1) convert a PDF to a Word file, (2) extract tables from the PDF and insert them into the Word document, (3) extract images (placeholder logic) and add them to the document, and (4) save the final .docx file.

from pdf2docx import Converter
import pdfplumber
from docx import Document
from docx.shared import Inches
from PIL import Image
import os

# 将PDF转换为Word文档
def convert_pdf_to_word(pdf_path, word_path):
    cv = Converter(pdf_path)
    cv.convert(word_path, start=0, end=None)
    cv.close()

# 从PDF中提取表格并添加到Word文档
def add_tables_from_pdf(pdf_path, doc):
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            tables = page.extract_tables()
            for table_data in tables:
                # 添加表格到Word文档
                table = doc.add_table(rows=len(table_data), cols=len(table_data[0]))
                for i, row in enumerate(table_data):
                    for j, cell in enumerate(row):
                        table.cell(i, j).text = str(cell or "")

# 从PDF中提取图片并添加到Word文档
def add_images_from_pdf(pdf_path, doc):
    temp_image_folder = "temp_images"
    if not os.path.exists(temp_image_folder):
        os.makedirs(temp_image_folder)
    # 这里只是一个示意,实际操作中需要根据PDF的具体情况来提取图片
    # 对于大多数PDF文件,可能需要更复杂的逻辑来提取嵌入的图片
    # 此处省略具体实现,因为pdf2docx已经能处理大部分情况下的图片
    # 假设我们有办法提取图片并将它们保存到临时文件夹
    for img_path in os.listdir(temp_image_folder):
        doc.add_picture(os.path.join(temp_image_folder, img_path), width=Inches(4))

# 主函数
def main(pdf_file, output_docx):
    # 初始化Word文档
    doc = Document()
    # 直接转换PDF到Word文档
    convert_pdf_to_word(pdf_file, output_docx)
    # 打开已转换的Word文档进行修改
    doc = Document(output_docx)
    # 添加PDF中的表格到Word文档
    add_tables_from_pdf(pdf_file, doc)
    # 添加PDF中的图片到Word文档
    add_images_from_pdf(pdf_file, doc)
    # 保存最终的Word文档
    doc.save(output_docx)

# 调用主函数
main("example.pdf", "output.docx")

Notes : The provided code offers a basic framework; handling all PDF elements may require custom logic for specific PDFs, especially for complex tables and image extraction. Some formatting may be lost during conversion, and more specialized tools might be needed for highly complex documents.

pythonPDFworddocument conversionpdf2docxpdfplumber
Test Development Learning Exchange
Written by

Test Development Learning Exchange

Test Development Learning Exchange

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.