Batch Replacing DOCX Watermark Text with Python: Docs vs DOCX & Shape Traversal Pitfalls
This article walks through a fan‑requested Python automation that replaces artistic watermark text in shipping‑bill DOCX files, detailing the initial python‑docx attempt, the shortcomings of win32com shape traversal, and the final reliable solution using zipfile‑based XML replacement.
A follower asked for a way to replace the artistic text inside a shipping‑bill document while keeping all font, color, and style attributes unchanged. The task requires batch processing of .doc/.docx files, extracting the watermark text from floating shapes in the header, and substituting it with a new string.
The first solution converts the source .doc to .docx, installs python-docx, and runs a function that iterates over each document section and each shape found via XPath .//a:blip/../..//a:txBody. It collects all text runs, matches the exact old string, clears the original nodes, writes the new text into the first run, and removes any extra empty runs. The code snippet is:
from docx import Document
from docx.oxml import parse_xml
# ======================配置区,按需修改文件名======================
source_file = "Original.docx" # 你的源docx文件名
output_file = "修改后_提单Original.docx" # 输出新文件名称
old_text = "Non-Negotiable"
new_text = "xxx Original"
# =================================================================
def replace_shape_text(doc_path, out_path, target_old, target_new):
doc = Document(doc_path)
# 遍历文档每一节、每一页所有形状
for section in doc.sections:
# 1. 遍历正文内所有浮动Shape(提单里大字水印属于浮动形状)
for shape in doc.element.xpath('.//a:blip/../..//a:txBody'):
# 获取文本段落
text_body = shape.find('.//a:p')
if text_body is None:
continue
full_text = ""
run_list = []
# 提取所有文字片段
for r in text_body.xpath('.//a:r'):
t_node = r.find('.//a:t')
if t_node is not None and t_node.text:
full_text += t_node.text
run_list.append((r, t_node))
# 匹配目标旧文字
if full_text.strip() == target_old.strip():
# 只替换文字内容,完全复用原有字体、大小、颜色、位置、轮廓
# 清空原有文字节点
for r, t in run_list:
t.text = ""
# 写入新文字到第一个文字节点,继承全部样式属性
if run_list:
first_t = run_list[0][1]
first_t.text = target_new
# 多余空run删除,避免空白占位
for idx in range(1, len(run_list)):
run_list[idx][0].getparent().remove(run_list[idx][0])
# 保存修改后的文档
doc.save(out_path)
print(f"✅ 替换完成!新文件已生成:{out_path}")
replace_shape_text(source_file, output_file, old_text, new_text)Although the script runs without errors, the resulting document still contains the original artistic text; the replacement never takes effect. The author attributes this to python-docx 's limited support for complex floating shapes.
Following AI advice, the author switches to the pywin32 library, which interacts directly with the Word COM interface and can manipulate header shapes and text boxes more reliably. After installing pywin32 and executing the revised script, the program again finishes without exceptions, yet the watermark remains unchanged, indicating that even the COM‑based approach fails for this particular document structure.
Finally, the author adopts a low‑level method: treating the .docx file as a zip archive, extracting all XML parts, performing a plain string replacement of the target phrase, and repackaging the archive. The complete implementation is:
import zipfile
import os
import shutil
# 配置区
INPUT_FILE = "Original.docx"
OUTPUT_FILE = "提单_已替换完成.docx"
OLD_STR = "xxx Non-Negotiable"
NEW_STR = "xxx Original"
TEMP_FOLDER = "temp_unzip_cache"
# 清理旧临时文件夹
if os.path.exists(TEMP_FOLDER):
shutil.rmtree(TEMP_FOLDER)
os.mkdir(TEMP_FOLDER)
# 1. 解压docx(移除多余encoding参数)
with zipfile.ZipFile(INPUT_FILE, "r") as zip_ref:
zip_ref.extractall(TEMP_FOLDER)
# 2. 遍历全部XML文件,全局替换文本
for root, _, files in os.walk(TEMP_FOLDER):
for filename in files:
if filename.endswith(".xml"):
full_path = os.path.join(root, filename)
with open(full_path, "r", encoding="utf-8") as f:
xml_content = f.read()
if OLD_STR in xml_content:
xml_content = xml_content.replace(OLD_STR, NEW_STR)
with open(full_path, "w", encoding="utf-8") as f:
f.write(xml_content)
# 3. 重新打包生成新docx
with zipfile.ZipFile(OUTPUT_FILE, "w", zipfile.ZIP_DEFLATED) as new_zip:
for root, _, files in os.walk(TEMP_FOLDER):
for filename in files:
full_path = os.path.join(root, filename)
relative_path = os.path.relpath(full_path, TEMP_FOLDER)
new_zip.write(full_path, relative_path)
# 删除临时缓存目录
shutil.rmtree(TEMP_FOLDER)
print(f"✅ 处理结束,输出文件:{OUTPUT_FILE}")This approach successfully produces a new DOCX where the watermark text has been replaced, satisfying the original requirement.
The article concludes by summarizing three major pitfalls encountered: (1) treating a .doc file as a .docx and attempting direct zip extraction leads to initial errors; (2) using win32com to enumerate header shapes often returns zero matches for artistic text; (3) WPS compatibility issues cause .Shapes attribute errors. The author promises a follow‑up article that will address further client‑driven adjustments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
