Fundamentals 5 min read

How to Scan Thousands of Word Files for a Keyword Using Python

This article explains how to use Python's os module and python-docx library to recursively traverse a folder hierarchy, read .docx files, search for a specific keyword such as "nan", and output the names of matching documents, with optional performance tips.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Scan Thousands of Word Files for a Keyword Using Python

1. Introduction

A user in a Python community needed a script to scan a folder named "省份" containing many subfolders and Word documents, and to list the files that contain the keyword nan. The solution demonstrates a practical Python automation approach.

2. Implementation

The task can be accomplished by using Python's os module to walk through directories and the python-docx library to read Word file contents. The script searches for the keyword and prints the matching file names.

import os
from docx import Document
# Set the keyword to search for
keyword = 'nan'
# Set the root directory
root_dir = '省份'
# Walk through the directory tree
for foldername, subfolders, filenames in os.walk(root_dir):
    for filename in filenames:
        # Process only .docx files
        if filename.endswith('.docx'):
            filepath = os.path.join(foldername, filename)
            doc = Document(filepath)
            # Check if the keyword is in the document text
            if keyword in doc.full_text:
                print(f'Found keyword in {filename}')

Note that the script only handles files with the .docx extension and prints the file name when the keyword is found. Ensure the python-docx library is installed, for example with pip install python-docx. For large numbers of files, consider using multithreading or multiprocessing to improve performance.

3. Conclusion

The provided solution successfully scans all Word documents under the specified folder hierarchy for the given keyword and outputs the matching file names, illustrating an effective Python automation technique.

PythonOS modulepython-docxkeyword-searchFile AutomationWord Documents
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.