Fundamentals 7 min read

How to Automate Word Question Extraction with Python: Regex, Openpyxl & Pandas

This article walks through extracting quiz questions, options, and answers from a Word document using Python, demonstrating why the original regex failed and providing three refined solutions with regular expressions, openpyxl, and pandas for reliable automation.

Python Crawling & Data Mining

Oct 19, 2022

How to Automate Word Question Extraction with Python: Regex, Openpyxl & Pandas

1. Introduction

In a Python community chat a user asked how to automate the extraction of multiple‑choice questions, options, and answers from a Word document. The initial script using regular expressions did not display the options and answers as expected.

The result of running the original code was shown, and the user expected both the question text and the corresponding options/answers to appear.

import re

black_char = re.compile("[\s\u3000\xa0]+")

chinese_nums_rule = re.compile("[一二三四]、(.+?)\(")
title_rule = re.compile("\d+.")
option_rule = re.compile("\([ABCDEF]\)")
option_rule_search = re.compile("\([ABCDEF]\)[^(]+")
answer_rule = re.compile("\([ABCDEF]\)")

# from word document's "一、单项选择题" start iterating
for paragraph in doc.paragraphs[1:100]:
    # remove whitespace, convert full‑width to half‑width, adjust spaces around brackets
    line = black_char.sub("", paragraph.text).replace("（", "(").replace("）", ")").replace("．", ".").replace("()", "(  )").replace("【", "").replace("】", "")
    # skip empty lines
    if not line:
        continue
    if title_rule.search(line):
        print("题目", line)
    elif option_rule.search(line):
        print("选项", option_rule_search.findall(line))
    elif answer_rule.search(line):
        print("答案", answer_rule.findall(line))
    else:
        chinese_nums_match = chinese_nums_rule.search(line)
        if chinese_nums_match:
            print("题目", chinese_nums_match.group(1))

2. Implementation

The failure was due to an incorrect regular expression that did not match the option patterns. A corrected version was provided, adjusting the option regex and handling answer detection.

import re

black_char = re.compile("[\s\u3000\xa0]+")

chinese_nums_rule = re.compile("[一二三四]、(.+?)\(")
title_rule = re.compile("\d+.")
option_rule = re.compile("([A-F]\\..+?)\\s")
answer_rule = re.compile("【答案】([A-F])")

for paragraph in doc.paragraphs[1:100]:
    line = black_char.sub(" ", paragraph.text).replace("（", "(").replace("）", ")").replace("．", ".").replace("()", "(  )") + " "
    if not line:
        continue
    if title_rule.match(line):
        print("题目", line)
    elif option_rule.match(line):
        print("选项", option_rule.findall(line))
        if '【答案】' in line and answer_rule.search(line):
            print("答案", answer_rule.findall(line))
    elif answer_rule.match(line):
        print("答案", answer_rule.findall(line))
    else:
        chinese_nums_match = chinese_nums_rule.match(line)
        if chinese_nums_match:
            print("题目", chinese_nums_match.group(1))

Using openpyxl, the extracted data can be written directly to an Excel file:

from docx import Document
import openpyxl
wb = openpyxl.Workbook()
ws = wb.active
ws.append(['题目','选项1','选项2','选项3','选项4','答案'])

doc = Document("题库.docx")
all_runs = doc.paragraphs
rows = []
for run in all_runs[1:]:
    print([run.text])
    if '【答案】' in run.text:
        text_list = run.text.replace('
    ', '\t\t').replace('【答案】', '').split('\t\t')
        rows += text_list
        ws.append(rows)
        rows = []
        continue
    text_list = run.text.replace('
    ', '\t\t').split('\t\t')
    rows += text_list
wb.save('1.xlsx')

Finally, a pandas solution parses the whole document text with a single regular expression and exports the result to Excel:

import re
import pandas as pd
from docx import Document

doc = Document("题库.docx")
text = re.sub(r'<.*?>', '', doc.part.blob.decode('utf-8'), flags=re.S)

a = pd.DataFrame(
    re.findall(r'(\d+\..*?)(A\..*?)(B\..*?)(C\..*?)(D\..*?)【答案】([A-Z])', text),
    columns=['题目','选项一','选项二','选项三','选项四','答案']
)
a.replace([r'^\s+', '\s+$'], '', regex=True, inplace=True)
a.to_excel('题库.xlsx', index=False)

All three approaches successfully extract the desired question, options, and answer data, demonstrating that the core issue was the regular‑expression pattern.

3. Conclusion

The article presented a step‑by‑step resolution for a Python automation problem involving Word document parsing. By correcting the regex and leveraging openpyxl or pandas, users can reliably convert quiz content into structured Excel files.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation regex openpyxl

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.