Batch Extract Unified Social Credit Codes from PDFs Using Python
This article demonstrates how to use Python's pdfminer library and regular expressions to automatically extract unified social credit codes from multiple PDF files, providing step‑by‑step code for single‑file extraction and scalable batch processing.
Preface
When helping a follower with a simple requirement to batch‑extract target information from PDF files, I realized the solution could benefit many others, so I share it here.
Requirement Clarification
The task involves dozens of PDF files, each containing a unified social credit code. The goal is to extract the code (a mix of numbers and letters) from every file.
Implementation Process
First, use pdfminer to read a PDF and a regular expression to locate the credit code. The single‑file extraction code is:
from pdfminer import high_level
import re
text = high_level.extract_text('1.pdf') # extract text from PDF
regex = r'统一社会信用代码:(.*?)
'
credit_codes = re.findall(regex, text)
print(credit_codes)After confirming the single‑file approach works, the batch processing script iterates over all PDF files in a directory, extracts the text, applies the same regex, and prints each code:
from pdfminer import high_level
from pdfminer.layout import LTTextContainer
import re
import os
for root, dirs, files in os.walk('./'):
for f in files:
file_name = os.path.join(root, f)
if file_name.endswith('.pdf'):
text = high_level.extract_text(file_name)
regex = r'统一社会信用代码:(.*?)
'
credit_codes = re.findall(regex, text)
print(credit_codes[0])Running the script prints the extracted codes for all PDFs, as shown in the following screenshot:
Conclusion
This guide provides a practical example of batch‑extracting specific information from PDF documents using Python, pdfminer, and regular expressions, enabling readers to adapt the approach for similar data‑extraction tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
