Fundamentals 4 min read

Batch Extract Unified Social Credit Codes from PDFs Using Python

This article demonstrates how to use Python's pdfminer library and regular expressions to automatically extract unified social credit codes from multiple PDF files, providing step‑by‑step code for single‑file extraction and scalable batch processing.

Python Crawling & Data Mining

Nov 28, 2022

Batch Extract Unified Social Credit Codes from PDFs Using Python

Preface

When helping a follower with a simple requirement to batch‑extract target information from PDF files, I realized the solution could benefit many others, so I share it here.

Requirement Clarification

The task involves dozens of PDF files, each containing a unified social credit code. The goal is to extract the code (a mix of numbers and letters) from every file.

Implementation Process

First, use pdfminer to read a PDF and a regular expression to locate the credit code. The single‑file extraction code is:

from pdfminer import high_level
import re

text = high_level.extract_text('1.pdf')  # extract text from PDF
regex = r'统一社会信用代码：(.*?)
'
credit_codes = re.findall(regex, text)
print(credit_codes)

After confirming the single‑file approach works, the batch processing script iterates over all PDF files in a directory, extracts the text, applies the same regex, and prints each code:

from pdfminer import high_level
from pdfminer.layout import LTTextContainer
import re
import os

for root, dirs, files in os.walk('./'):
    for f in files:
        file_name = os.path.join(root, f)
        if file_name.endswith('.pdf'):
            text = high_level.extract_text(file_name)
            regex = r'统一社会信用代码：(.*?)
'
            credit_codes = re.findall(regex, text)
            print(credit_codes[0])

Running the script prints the extracted codes for all PDFs, as shown in the following screenshot:

Conclusion

This guide provides a practical example of batch‑extracting specific information from PDF documents using Python, pdfminer, and regular expressions, enabling readers to adapt the approach for similar data‑extraction tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Batch Processing regex pdfminer

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.