Automate Excel Data Extraction with Python openpyxl: From Single to 1000 Files
This tutorial shows how to use Python's openpyxl (and optionally glob) to extract rows with purchase counts over 50 from an Excel sheet, create a new workbook, and scale the solution to batch‑process a thousand similar files automatically.
Introduction
Hello everyone, welcome to the Python office automation series. In this article we demonstrate a simple case: using openpyxl to extract specific data from an Excel file, generate a new file, and then automate the process for multiple files.
Learning Points
Using the openpyxl module
Using the glob module to create batch processes
Data Source
The example uses the e‑commerce baby dataset from Alibaba Cloud Tianchi, which can be downloaded as an Excel file.
Requirement Description
Basic difficulty: From 电商婴儿数据.xlsx extract rows where the purchase count (column F, field buy_mount) exceeds 50, and write those rows into a new Excel workbook.
Advanced difficulty: In a folder containing 1000 Excel files named 电商婴儿数据1.xlsx … 电商婴儿数据1000.xlsx, extract all rows meeting the same condition from each file and aggregate them into a single new workbook.
Python Implementation
First, handle the basic case for a single file:
from openpyxl import load_workbook, Workbook
# Path to the folder containing the data file
path = 'C:/Users/xxxxxx'
# Open the workbook
workbook = load_workbook(path + '/' + '电商婴儿数据.xlsx')
sheet = workbook.active
# Select column F (purchase count)
buy_mount = sheet['F']
row_lst = []
for cell in buy_mount:
if isinstance(cell.value, int) and cell.value > 50:
row_lst.append(cell.row)
# Create a new workbook and copy the header
new_workbook = Workbook()
new_sheet = new_workbook.active
header = sheet[1]
header_lst = [cell.value for cell in header]
new_sheet.append(header_lst)
# Copy the qualifying rows
for row in row_lst:
data_lst = [cell.value for cell in sheet[row]]
new_sheet.append(data_lst)
# Save the result
new_workbook.save(path + '/' + '符合筛选条件的新表.xlsx')To scale up to 1000 files, use glob to iterate over all .xlsx files and reuse the above logic, creating the new workbook only once and adding the header a single time:
from openpyxl import load_workbook, Workbook
import glob
path = 'C:/Users/xxxxxx'
new_workbook = Workbook()
new_sheet = new_workbook.active
header_added = False
for file in glob.glob(path + '/*.xlsx'):
workbook = load_workbook(file)
sheet = workbook.active
buy_mount = sheet['F']
row_lst = []
for cell in buy_mount:
if isinstance(cell.value, int) and cell.value > 50:
row_lst.append(cell.row)
if not header_added:
header = sheet[1]
new_sheet.append([cell.value for cell in header])
header_added = True
for row in row_lst:
new_sheet.append([cell.value for cell in sheet[row]])
new_workbook.save(path + '/' + '符合筛选条件的新表.xlsx')Conclusion
The complete process shows how to batch‑extract specific data from Excel files using Python. While pandas could achieve the same with fewer lines, openpyxl offers finer control over Excel features and is used throughout this automation series. Mastering such batch operations frees your hands from repetitive manual work.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
