Fundamentals 6 min read

How to Automate Extraction of Checkmarks from Word Docs with Python

Learn how to automate the extraction of answer symbols (√ and ×) from Word documents using Python, with step-by-step code examples employing docx, pandas, regex, and openpyxl to generate Excel files, and explore multiple community-contributed solutions.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
How to Automate Extraction of Checkmarks from Word Docs with Python

1. Introduction

The author received a request in a Python community group to automatically extract answer symbols (√ and ×) from a Word document and output them to an Excel file. The desired result is a table where each question number is paired with its corresponding symbol.

2. Implementation Process

Several community members shared workable solutions.

Solution A – Basic docx + pandas

import re
from docx import Document
import pandas as pd

document = Document("判断(括号处理)(1).docx")
all_paragraphs = document.paragraphs

data = [paragraph.text for paragraph in all_paragraphs if '√' in paragraph.text or '×' in paragraph.text]
data = ''.join(data)
res = re.findall('[√×]', data, re.S)
res = [f'{k + 1}.{v}' for k, v in enumerate(res)]
df = pd.DataFrame(res)
df.to_excel('test9-13.xlsx', index=False, header=None)

This script reads the Word file, filters paragraphs containing the symbols, concatenates them, extracts each symbol with its index, and writes the result to an Excel file.

Solution B – Simplified version

import re
from docx import Document
import pandas as pd

document = Document(r"判断(括号处理)(1).docx")
text = document.part.blob.decode('utf-8')

text = re.sub(r'<.*?>', '', text)
text = re.sub(r'\.\s+', '.', text)
df = pd.DataFrame(re.findall(r'\d+\.[√×]', text))
df.to_excel('result.xlsx', header=None, index=False)

This approach decodes the document binary, removes any HTML‑like tags, normalises periods, then extracts patterns like "1.√" directly.

Solution C – Refined regex handling

data = [paragraph.text for paragraph in all_paragraphs if '√' in paragraph.text or '×' in paragraph.text]
# Merge into a long string and remove spaces
data = ''.join(data).replace(' ', '')
# Use regex to extract numbered answers
res = re.findall(r'\d+\.[√×]', data, re.S)
df = pd.DataFrame(res)
df.to_excel('test9-13.xlsx', index=False, header=None)

By stripping spaces before applying the regular expression, this version avoids mismatches caused by stray whitespace.

Solution D – Using openpyxl for full control

import re
import docx
import openpyxl

def str_work(string: str):
    return [*filter(None, re.split('\.', re.sub('\d+', '', string.replace(' ', '').replace('
', ''))))]

wb = openpyxl.Workbook()
ws = wb.active
ws.append(['题目', '答案'])

doc = docx.Document(r'C:\Users\Administrator\Desktop\判断(括号处理).docx')
doc_text = '
'.join(i.text for i in doc.paragraphs[3:])

doc_list = doc_text.split('
一、判断题')
title_row = [i.strip() for i in doc_list[0].split('
') if i.strip().split('、') != ['']]
answer_row = [i for i in str_work(doc_list[1])]
for i in zip(title_row, answer_row):
    ws.append(list(i))

wb.save('1.xlsx')

This script builds the Excel file manually with openpyxl, giving fine‑grained control over column headers and row insertion.

3. Summary

The article demonstrates several Python‑based methods to automate the extraction of answer symbols from Word documents and export them to Excel, showcasing the use of python-docx, pandas, regular expressions, and openpyxl. These solutions help community members quickly solve similar automation tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

automationExcelregexpandasWord
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.