How to Scan Thousands of Word Files for a Keyword Using Python
This article explains how to use Python's os module and python-docx library to recursively traverse a folder hierarchy, read .docx files, search for a specific keyword such as "nan", and output the names of matching documents, with optional performance tips.
1. Introduction
A user in a Python community needed a script to scan a folder named "省份" containing many subfolders and Word documents, and to list the files that contain the keyword nan. The solution demonstrates a practical Python automation approach.
2. Implementation
The task can be accomplished by using Python's os module to walk through directories and the python-docx library to read Word file contents. The script searches for the keyword and prints the matching file names.
import os
from docx import Document
# Set the keyword to search for
keyword = 'nan'
# Set the root directory
root_dir = '省份'
# Walk through the directory tree
for foldername, subfolders, filenames in os.walk(root_dir):
for filename in filenames:
# Process only .docx files
if filename.endswith('.docx'):
filepath = os.path.join(foldername, filename)
doc = Document(filepath)
# Check if the keyword is in the document text
if keyword in doc.full_text:
print(f'Found keyword in {filename}')Note that the script only handles files with the .docx extension and prints the file name when the keyword is found. Ensure the python-docx library is installed, for example with pip install python-docx. For large numbers of files, consider using multithreading or multiprocessing to improve performance.
3. Conclusion
The provided solution successfully scans all Word documents under the specified folder hierarchy for the given keyword and outputs the matching file names, illustrating an effective Python automation technique.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
