Fundamentals 14 min read

How to Convert Complex Word Exam Tables into Structured Excel with Python

This article demonstrates how to read a Word document containing exam tables, parse question types, titles, and options using Python, and then transform the data into a structured Excel file with pandas, providing complete code and handling special whitespace characters.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Convert Complex Word Exam Tables into Structured Excel with Python

Requirement

There is a Word table in the following format that needs to be converted into an Excel table with a different layout.

The desired Excel format is shown below.

Test Word Document Reading

First, read the first page of the Word document and print each paragraph.

from docx import Document

doc = Document("编号02 质检员高级技师(一级)理论试卷.docx")
for i, paragraph in enumerate(doc.paragraphs[:55]):
    print(i, paragraph.text)
0 职业能力评价理论试卷
1 (一级. 质检员)
2 注意事项:1.答卷前将密封线内的项目填写清楚。
3 2.填写答案必须用蓝色(或黑色)钢笔、圆珠笔,不许用铅笔或红笔。
4 3.本份试卷共四道大题,满分100分,考试时间120分钟。
5 一、单项选择题(第1~40题。选择正确的答案,将相应的字母填入题内的括号中。每题1分,满分40分。)
6 1. 关于道德的叙述,正确的是(   )。
7 (A)道德中的“应该”与“不应该”因人而异,没有共同道德标准
8 (B)道德是处理人与人之间、人与社会之间关系的特殊行为规范
9 (C)道德是现代文明的产物        
10 (D)道德从来没有阶级性
... (remaining lines omitted for brevity)

The output shows that each line of text can be retrieved successfully.

Match Question Types, Titles, and Options

We need to match question types, titles, and options based on the following patterns:

Question type starts with a Chinese numeral followed by "、".

Title starts with a regular number followed by a period.

Option starts with a parenthesis and a letter.

Additional notes: The first few lines may also start with a regular number and period and should be excluded. Some questions contain special whitespace characters that need to be removed. Both half‑width and full‑width parentheses and periods appear and must be normalized.

Code to perform the matching:

import re
from docx import Document

black_char = re.compile("[\s\u3000\xa0]+")
chinese_nums_rule = re.compile("[一二三四]、(.+?)\(")
title_rule = re.compile("\d+.")
option_rule = re.compile("\([ABCDEF]\)")
option_rule_search = re.compile("\([ABCDEF]\)[^(]+")

for paragraph in doc.paragraphs[5:]:
    line = black_char.sub("", paragraph.text).replace("(", "(").replace(")", ")").replace(".", ".").replace("()", "(  )")
    if not line:
        continue
    if title_rule.match(line):
        print("题目", line)
    elif option_rule.match(line):
        print("选项", option_rule_search.findall(line))
    else:
        chinese_nums_match = chinese_nums_rule.match(line)
        if chinese_nums_match:
            print("题型", chinese_nums_match.group(1))

The test output confirms correct matching.

Save Matched Data to a Structured Dictionary

Design a dictionary to store the extracted data as shown below.

Code to populate the dictionary:

import re
from docx import Document
from collections import OrderedDict

question_type2data = OrderedDict()
for paragraph in doc.paragraphs[5:]:
    line = black_char.sub("", paragraph.text).replace("(", "(").replace(")", ")").replace(".", ".").replace("()", "(  )")
    if not line:
        continue
    if title_rule.match(line):
        options = title2options.setdefault(line, [])
    elif option_rule.match(line):
        options.extend(option_rule_search.findall(line))
    else:
        chinese_nums_match = chinese_nums_rule.match(line)
        if chinese_nums_match:
            question_type = chinese_nums_match.group(1)
            title2options = question_type2data.setdefault(question_type, OrderedDict())

Traverse the Structured Dictionary and Store to Pandas

Convert the dictionary into a pandas DataFrame and export to Excel.

import pandas as pd

result = []
max_options_len = 0
for question_type, title2options in question_type2data.items():
    for title, options in title2options.items():
        result.append([question_type, title, *options])
        options_len = len(options)
        if options_len > max_options_len:
            max_options_len = options_len

df = pd.DataFrame(result, columns=["题型", "题目"] + [f"选项{i}" for i in range(1, max_options_len+1)])
# Simplify question type names
df['题型'] = df['题型'].str.replace("选择", "")
print(df.head())

Resulting table:

Finally, save the DataFrame to an Excel file:

df.to_excel("result.xlsx", index=False)

Complete Code

import pandas as pd
import re
from docx import Document
from collections import OrderedDict

doc = Document("编号02 质检员高级技师(一级)理论试卷.docx")

black_char = re.compile("[\s\u3000\xa0]+")
chinese_nums_rule = re.compile("[一二三四]、(.+?)\(")
title_rule = re.compile("\d+.")
option_rule = re.compile("\([ABCDEF]\)")
option_rule_search = re.compile("\([ABCDEF]\)[^(]+")

question_type2data = OrderedDict()
for paragraph in doc.paragraphs[5:]:
    line = black_char.sub("", paragraph.text).replace("(", "(").replace(")", ")").replace(".", ".").replace("()", "(  )")
    if not line:
        continue
    if title_rule.match(line):
        options = title2options.setdefault(line, [])
    elif option_rule.match(line):
        options.extend(option_rule_search.findall(line))
    else:
        chinese_nums_match = chinese_nums_rule.match(line)
        if chinese_nums_match:
            question_type = chinese_nums_match.group(1)
            title2options = question_type2data.setdefault(question_type, OrderedDict())

result = []
max_options_len = 0
for question_type, title2options in question_type2data.items():
    for title, options in title2options.items():
        result.append([question_type, title, *options])
        options_len = len(options)
        if options_len > max_options_len:
            max_options_len = options_len

df = pd.DataFrame(result, columns=["题型", "题目"] + [f"选项{i}" for i in range(1, max_options_len+1)])
df['题型'] = df['题型'].str.replace("选择", "")

df.to_excel("result.xlsx", index=False)
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythondocx
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.