Extract Word Document Keywords, Frequencies, and POS with Python
This guide shows how to use Python libraries such as docx, jieba, NLTK, and openpyxl to read a Word file, perform tokenization, compute word frequencies, assign part‑of‑speech tags, and export the results into an Excel spreadsheet, including troubleshooting tips for common errors.
In response to a fan’s request to extract keywords, their frequencies, and part‑of‑speech tags from a Word document and output them to an Excel file, the author first tried using the win32com library but encountered an AttributeError: 'str' object has no attribute 'tag'. A subsequent solution uses the python-docx, jieba, NLTK, and openpyxl libraries.
Step‑by‑step solution
Read all text from the Word document with docx.Document.
Tokenize the text using jieba.cut and filter out irrelevant tokens.
Count word frequencies with collections.Counter.
Tag each word’s part of speech using nltk.pos_tag.
Create an Excel workbook with openpyxl.Workbook and add a worksheet.
Write the keywords, frequencies, and POS tags into separate columns.
Full implementation code:
import docx
import jieba
from collections import Counter
import openpyxl
from openpyxl import Workbook
from nltk import pos_tag
# Read Word document
doc = docx.Document('test.docx')
text = ""
for para in doc.paragraphs:
text += para.text
# Tokenize and filter
words = [w for w in jieba.cut(text) if len(w) > 1 and not w.isnumeric()]
# Frequency count
word_counts = Counter(words)
# POS tagging
pos_dict = dict(pos_tag(word_counts.keys()))
# Prepare data
keywords = []
for word, count in word_counts.items():
pos = pos_dict.get(word, '')
keywords.append([word, count, pos])
# Write to Excel
wb = Workbook()
sheet = wb.active
sheet['A1'] = '关键词'
sheet['B1'] = '词频'
sheet['C1'] = '词性'
for i, row in enumerate(keywords, start=2):
sheet[f'A{i}'] = row[0]
sheet[f'B{i}'] = row[1]
sheet[f'C{i}'] = row[2]
wb.save('keywords.xlsx')Before running the script, install the required packages: pip install jieba nltk openpyxl python-docx If NLTK reports “Resource averaged_perceptron_tagger not found”, download it via the NLTK Downloader.
After execution, the generated Excel file contains three columns: keyword, frequency, and part‑of‑speech tag, as shown in the screenshot below.
The article also notes that the POS tag meanings can be looked up in NLTK documentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
