Fundamentals 6 min read

Automate Word Table Extraction to Excel with Python: A Step-by-Step Guide

This guide demonstrates how to programmatically pull date, title, and document number fields from thousands of similarly structured Word tables, normalize the dates, and export the data into a formatted Excel spreadsheet using Python's docx, datetime, and openpyxl libraries.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Automate Word Table Extraction to Excel with Python: A Step-by-Step Guide

In this tutorial we show how to extract specific fields (date, title, document number) from hundreds of similarly formatted tables in a Word document and store them into an Excel spreadsheet.

First we import the required libraries:

# import needed libraries
from docx import Document
import datetime
from openpyxl import Workbook

We load the Word file, iterate over all tables, and for each entry (which occupies three rows) we retrieve the date, title, and document number. The date strings are originally in "day/month" format and are converted to "YYYY-MM-DD" using datetime.datetime.strptime and strftime. Empty dates are replaced with "-".

Example extraction loop:

# example loop
n = 0
for j in range(len(tables)):
    for i in range(0, len(tables[j].rows)+1, 3):
        try:
            date = tables[j].cell(i, 1).text
            if '/' in date:
                date = datetime.datetime.strptime(date, '%d/%m').strftime('2020-%m-%d')
            else:
                date = '-'
            title = tables[j].cell(i+1, 1).text.strip()
            dfn = tables[j].cell(i, 3).text.strip()
            n += 1
            row = [n, date, ' ', title, dfn, ' ']
            sheet.append(row)
        except Exception as error:
            print(error)
            continue

After processing all tables we create an Excel workbook with headers

['序号','收文时间','办文编号','文件标题','文号','备注']

and append each extracted row. Finally we save the workbook to a file.

# create workbook and save
wb = Workbook()
sheet = wb.active
header = ['序号','收文时间','办文编号','文件标题','文号','备注']
sheet.append(header)
# ... (append rows in the loop above)
wb.save(r'C:\Users\20200420.xlsx')
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonAutomationData ExtractionExcelopenpyxldocx
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.