How to Automate Word Question Extraction with Python: Regex, Openpyxl & Pandas
This article walks through extracting quiz questions, options, and answers from a Word document using Python, demonstrating why the original regex failed and providing three refined solutions with regular expressions, openpyxl, and pandas for reliable automation.
1. Introduction
In a Python community chat a user asked how to automate the extraction of multiple‑choice questions, options, and answers from a Word document. The initial script using regular expressions did not display the options and answers as expected.
The result of running the original code was shown, and the user expected both the question text and the corresponding options/answers to appear.
import re
black_char = re.compile("[\s\u3000\xa0]+")
chinese_nums_rule = re.compile("[一二三四]、(.+?)\(")
title_rule = re.compile("\d+.")
option_rule = re.compile("\([ABCDEF]\)")
option_rule_search = re.compile("\([ABCDEF]\)[^(]+")
answer_rule = re.compile("\([ABCDEF]\)")
# from word document's "一、单项选择题" start iterating
for paragraph in doc.paragraphs[1:100]:
# remove whitespace, convert full‑width to half‑width, adjust spaces around brackets
line = black_char.sub("", paragraph.text).replace("(", "(").replace(")", ")").replace(".", ".").replace("()", "( )").replace("【", "").replace("】", "")
# skip empty lines
if not line:
continue
if title_rule.search(line):
print("题目", line)
elif option_rule.search(line):
print("选项", option_rule_search.findall(line))
elif answer_rule.search(line):
print("答案", answer_rule.findall(line))
else:
chinese_nums_match = chinese_nums_rule.search(line)
if chinese_nums_match:
print("题目", chinese_nums_match.group(1))2. Implementation
The failure was due to an incorrect regular expression that did not match the option patterns. A corrected version was provided, adjusting the option regex and handling answer detection.
import re
black_char = re.compile("[\s\u3000\xa0]+")
chinese_nums_rule = re.compile("[一二三四]、(.+?)\(")
title_rule = re.compile("\d+.")
option_rule = re.compile("([A-F]\\..+?)\\s")
answer_rule = re.compile("【答案】([A-F])")
for paragraph in doc.paragraphs[1:100]:
line = black_char.sub(" ", paragraph.text).replace("(", "(").replace(")", ")").replace(".", ".").replace("()", "( )") + " "
if not line:
continue
if title_rule.match(line):
print("题目", line)
elif option_rule.match(line):
print("选项", option_rule.findall(line))
if '【答案】' in line and answer_rule.search(line):
print("答案", answer_rule.findall(line))
elif answer_rule.match(line):
print("答案", answer_rule.findall(line))
else:
chinese_nums_match = chinese_nums_rule.match(line)
if chinese_nums_match:
print("题目", chinese_nums_match.group(1))Using openpyxl, the extracted data can be written directly to an Excel file:
from docx import Document
import openpyxl
wb = openpyxl.Workbook()
ws = wb.active
ws.append(['题目','选项1','选项2','选项3','选项4','答案'])
doc = Document("题库.docx")
all_runs = doc.paragraphs
rows = []
for run in all_runs[1:]:
print([run.text])
if '【答案】' in run.text:
text_list = run.text.replace('
', '\t\t').replace('【答案】', '').split('\t\t')
rows += text_list
ws.append(rows)
rows = []
continue
text_list = run.text.replace('
', '\t\t').split('\t\t')
rows += text_list
wb.save('1.xlsx')Finally, a pandas solution parses the whole document text with a single regular expression and exports the result to Excel:
import re
import pandas as pd
from docx import Document
doc = Document("题库.docx")
text = re.sub(r'<.*?>', '', doc.part.blob.decode('utf-8'), flags=re.S)
a = pd.DataFrame(
re.findall(r'(\d+\..*?)(A\..*?)(B\..*?)(C\..*?)(D\..*?)【答案】([A-Z])', text),
columns=['题目','选项一','选项二','选项三','选项四','答案']
)
a.replace([r'^\s+', '\s+$'], '', regex=True, inplace=True)
a.to_excel('题库.xlsx', index=False)All three approaches successfully extract the desired question, options, and answer data, demonstrating that the core issue was the regular‑expression pattern.
3. Conclusion
The article presented a step‑by‑step resolution for a Python automation problem involving Word document parsing. By correcting the regex and leveraging openpyxl or pandas, users can reliably convert quiz content into structured Excel files.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
