Fundamentals 13 min read

Batch Generate Word Docs from Excel with Python: zipfile & python-docx Solutions

This article explains how to automate the replacement of Excel row data into Word templates using Python, offering two approaches—one with python-docx and win32com for .doc conversion, and another leveraging zipfile to edit the underlying XML—complete with code examples and troubleshooting tips.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Batch Generate Word Docs from Excel with Python: zipfile & python-docx Solutions

Task Requirement

Given an Excel file containing target data, each row's values need to replace specific placeholders in a Word document and be saved individually.

Task Breakdown

The Word template is a .doc file; the placeholders are English strings highlighted in red within the main body.

Word document preview
Word document preview

The Excel file is well‑structured, so column names must match the placeholders in the Word document.

Excel data preview
Excel data preview

Solution 1: Using python-docx to Read Word

Problems Encountered

Incorrect installation command; the correct package is python-docx. python-docx cannot manipulate .doc files; the file must be converted to .docx first, which is done via win32com.

Text replacement fails when runs are split by formatting or language input, causing underlines to disappear. The workaround is to replace text at the run level and ensure placeholders are unique (e.g., #ColumnName#).

After solving these issues, the following script performs the conversion and replacement:

from copy import deepcopy
from pathlib import Path
from win32com import client as wc  # pip install pypiwin32
from docx import Document  # pip install python-docx
import pandas as pd

# Convert .doc to .docx if needed
def doctransform2docx(doc_path):
    docx_path = doc_path + 'x'
    suffix = doc_path.split('.')[-1]
    assert 'doc' in suffix, 'Input is not a Word document!'
    if suffix == 'docx':
        return Document(doc_path)
    word = wc.Dispatch('Word.Application')
    doc = word.Documents.Open(doc_path)
    doc.SaveAs2(docx_path, 16)  # 16 = wdFormatXMLDocument
    doc.Close()
    word.Quit()
    return Document(docx_path)

# Replace placeholders in a docx Document object
def replace_docx(name, values, wordfile, path_name='Company'):
    wordfile_copy = deepcopy(wordfile)
    for col_name, value in zip(name, values):
        if col_name == 'Company':
            path_name = str(value)
        for paragraph in wordfile_copy.paragraphs:
            for run in paragraph.runs:
                run.text = run.text.replace(col_name, str(value))
    wordfile_copy.save(f'{save_folder}/{path_name}.docx')

if __name__ == '__main__':
    doc_path = r"D:\solve_path\单位.doc"
    excel_path = r"D:\solve_path\信息.xls"
    save_folder = Path('D:/docx_save')
    save_folder.mkdir(parents=True, exist_ok=True)
    data = pd.read_excel(excel_path)
    wordfile = doctransform2docx(doc_path)
    data.apply(lambda x: replace_docx(x.index, x.values, wordfile), axis=1)

Even after successful replacement, the original checkboxes disappeared, indicating formatting loss.

Missing checkboxes after replacement
Missing checkboxes after replacement

Solution 2: Using zipfile to Edit the Underlying XML

Since a .docx file is essentially a ZIP archive of XML files, the workflow is:

Convert .doc to .docx via win32com (same as before).

Unzip the .docx and read word/document.xml.

Replace placeholders directly in the XML string.

Re‑zip the folder back into a .docx file.

Key pitfalls:

Use zipfile.zlib.DEFLATED to preserve the original folder hierarchy.

When writing files into the archive, provide the relative path (no absolute paths).

from shutil import rmtree
import zipfile
from copy import deepcopy
from pathlib import Path
from win32com import client as wc  # pip install pypiwin32
import pandas as pd

# Convert .doc to .docx (same as before)
def doctransform2docx(doc_path):
    docx_path = doc_path + 'x'
    suffix = doc_path.split('.')[-1]
    assert 'doc' in suffix, 'Input is not a Word document!'
    if suffix == 'docx':
        return Path(doc_path)
    word = wc.Dispatch('Word.Application')
    doc = word.Documents.Open(doc_path)
    doc.SaveAs2(docx_path, 16)
    doc.Close()
    word.Quit()
    return Path(docx_path)

# Unzip .docx and read document.xml
def docx_unzip(docx_path):
    docx_path = Path(docx_path)
    unzip_path = docx_path.with_name(docx_path.stem)
    with zipfile.ZipFile(docx_path, 'r') as f:
        for file in f.namelist():
            f.extract(file, path=unzip_path)
    xml_path = unzip_path.joinpath('word/document.xml')
    xml_file = xml_path.read_text(encoding='utf-8')
    return unzip_path, xml_path, xml_file

# Re‑zip the folder into a .docx
def docx_zipped(docx_path, zipped_path):
    docx_path = Path(docx_path)
    with zipfile.ZipFile(zipped_path, 'w', zipfile.zlib.DEFLATED) as f:
        for file in docx_path.glob('**/*.*'):
            f.write(file, file.as_posix().replace(docx_path.as_posix() + '/', ''))

# Remove temporary folder
def remove_folder(path):
    path = Path(path)
    if path.exists():
        rmtree(path)
    else:
        raise "System cannot find the specified file"

# Replace placeholders in the XML string and save
def replace_docx(name, values, xml_file, xml_path, unzip_path, path_name='Company'):
    xml_path = Path(xml_path)
    xml_file_copy = deepcopy(xml_file)
    for col_name, value in zip(name, values):
        if col_name == 'Company':
            path_name = str(value)
        xml_file_copy = xml_file_copy.replace(col_name, str(value))
    xml_path.write_text(xml_file_copy, encoding='utf-8')
    docx_zipped(unzip_path, f'{save_folder}/{path_name}.docx')

if __name__ == '__main__':
    doc_path = r"D:\solve_path\单位.doc"
    excel_path = r"D:\solve_path\信息.xls"
    save_folder = Path('D:/docx_save')
    save_folder.mkdir(parents=True, exist_ok=True)
    data = pd.read_excel(excel_path)
    docx_path = doctransform2docx(doc_path)
    unzip_path, xml_path, xml_file = docx_unzip(docx_path)
    data.apply(lambda x: replace_docx(x.index, x.values, xml_file, xml_path, unzip_path), axis=1)
    remove_folder(unzip_path)

After re‑zipping, the checkboxes and underlines are preserved.

Final document with preserved formatting
Final document with preserved formatting

Conclusion

Through trial and error, a method was found that replaces strings in a Word document without altering its original formatting. While this solution works, there may be more elegant approaches.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonautomationBatch Processingdocxzipfile
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.