How to Convert Complex Word Exam Tables into Structured Excel with Python
This article demonstrates how to read a Word document containing exam tables, parse question types, titles, and options using Python, and then transform the data into a structured Excel file with pandas, providing complete code and handling special whitespace characters.
Requirement
There is a Word table in the following format that needs to be converted into an Excel table with a different layout.
The desired Excel format is shown below.
Test Word Document Reading
First, read the first page of the Word document and print each paragraph.
from docx import Document
doc = Document("编号02 质检员高级技师(一级)理论试卷.docx")
for i, paragraph in enumerate(doc.paragraphs[:55]):
print(i, paragraph.text) 0 职业能力评价理论试卷
1 (一级. 质检员)
2 注意事项:1.答卷前将密封线内的项目填写清楚。
3 2.填写答案必须用蓝色(或黑色)钢笔、圆珠笔,不许用铅笔或红笔。
4 3.本份试卷共四道大题,满分100分,考试时间120分钟。
5 一、单项选择题(第1~40题。选择正确的答案,将相应的字母填入题内的括号中。每题1分,满分40分。)
6 1. 关于道德的叙述,正确的是( )。
7 (A)道德中的“应该”与“不应该”因人而异,没有共同道德标准
8 (B)道德是处理人与人之间、人与社会之间关系的特殊行为规范
9 (C)道德是现代文明的产物
10 (D)道德从来没有阶级性
... (remaining lines omitted for brevity)The output shows that each line of text can be retrieved successfully.
Match Question Types, Titles, and Options
We need to match question types, titles, and options based on the following patterns:
Question type starts with a Chinese numeral followed by "、".
Title starts with a regular number followed by a period.
Option starts with a parenthesis and a letter.
Additional notes: The first few lines may also start with a regular number and period and should be excluded. Some questions contain special whitespace characters that need to be removed. Both half‑width and full‑width parentheses and periods appear and must be normalized.
Code to perform the matching:
import re
from docx import Document
black_char = re.compile("[\s\u3000\xa0]+")
chinese_nums_rule = re.compile("[一二三四]、(.+?)\(")
title_rule = re.compile("\d+.")
option_rule = re.compile("\([ABCDEF]\)")
option_rule_search = re.compile("\([ABCDEF]\)[^(]+")
for paragraph in doc.paragraphs[5:]:
line = black_char.sub("", paragraph.text).replace("(", "(").replace(")", ")").replace(".", ".").replace("()", "( )")
if not line:
continue
if title_rule.match(line):
print("题目", line)
elif option_rule.match(line):
print("选项", option_rule_search.findall(line))
else:
chinese_nums_match = chinese_nums_rule.match(line)
if chinese_nums_match:
print("题型", chinese_nums_match.group(1))The test output confirms correct matching.
Save Matched Data to a Structured Dictionary
Design a dictionary to store the extracted data as shown below.
Code to populate the dictionary:
import re
from docx import Document
from collections import OrderedDict
question_type2data = OrderedDict()
for paragraph in doc.paragraphs[5:]:
line = black_char.sub("", paragraph.text).replace("(", "(").replace(")", ")").replace(".", ".").replace("()", "( )")
if not line:
continue
if title_rule.match(line):
options = title2options.setdefault(line, [])
elif option_rule.match(line):
options.extend(option_rule_search.findall(line))
else:
chinese_nums_match = chinese_nums_rule.match(line)
if chinese_nums_match:
question_type = chinese_nums_match.group(1)
title2options = question_type2data.setdefault(question_type, OrderedDict())Traverse the Structured Dictionary and Store to Pandas
Convert the dictionary into a pandas DataFrame and export to Excel.
import pandas as pd
result = []
max_options_len = 0
for question_type, title2options in question_type2data.items():
for title, options in title2options.items():
result.append([question_type, title, *options])
options_len = len(options)
if options_len > max_options_len:
max_options_len = options_len
df = pd.DataFrame(result, columns=["题型", "题目"] + [f"选项{i}" for i in range(1, max_options_len+1)])
# Simplify question type names
df['题型'] = df['题型'].str.replace("选择", "")
print(df.head())Resulting table:
Finally, save the DataFrame to an Excel file:
df.to_excel("result.xlsx", index=False)Complete Code
import pandas as pd
import re
from docx import Document
from collections import OrderedDict
doc = Document("编号02 质检员高级技师(一级)理论试卷.docx")
black_char = re.compile("[\s\u3000\xa0]+")
chinese_nums_rule = re.compile("[一二三四]、(.+?)\(")
title_rule = re.compile("\d+.")
option_rule = re.compile("\([ABCDEF]\)")
option_rule_search = re.compile("\([ABCDEF]\)[^(]+")
question_type2data = OrderedDict()
for paragraph in doc.paragraphs[5:]:
line = black_char.sub("", paragraph.text).replace("(", "(").replace(")", ")").replace(".", ".").replace("()", "( )")
if not line:
continue
if title_rule.match(line):
options = title2options.setdefault(line, [])
elif option_rule.match(line):
options.extend(option_rule_search.findall(line))
else:
chinese_nums_match = chinese_nums_rule.match(line)
if chinese_nums_match:
question_type = chinese_nums_match.group(1)
title2options = question_type2data.setdefault(question_type, OrderedDict())
result = []
max_options_len = 0
for question_type, title2options in question_type2data.items():
for title, options in title2options.items():
result.append([question_type, title, *options])
options_len = len(options)
if options_len > max_options_len:
max_options_len = options_len
df = pd.DataFrame(result, columns=["题型", "题目"] + [f"选项{i}" for i in range(1, max_options_len+1)])
df['题型'] = df['题型'].str.replace("选择", "")
df.to_excel("result.xlsx", index=False)Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
