Fundamentals 10 min read

Extract Province Information from Chinese Addresses Using Python

This article walks through a Python‑based solution for parsing a list of Chinese address strings, extracting province abbreviations, grouping records by province, and optionally using regular expressions or pandas for further analysis, complete with full source code examples.

Python Crawling & Data Mining

Mar 14, 2022

Extract Province Information from Chinese Addresses Using Python

Introduction

The author, a Python enthusiast, shares a practical problem encountered in a chat group: extracting province or municipality information from a collection of delivery address strings.

Problem Statement

Given a list where each element contains a name and a full address, the goal is to separate the address into its province component and group the original records by that province.

Approach

The solution reads the address list, slices the first two characters of each address (the province abbreviation), removes duplicates, and then builds a dictionary that maps each province to the list of matching records.

Solution Code

# coding: utf-8

def sp(s):
    citys = []
    dizhi = []
    dice = {}
    dic = {}
    for i in s:
        a = i[1]
        city = a[0:2]
        zlib = a[0:2]
        citys.append(city)
        dizhi.append(zlib)
    cityss = set(citys)  # remove duplicates
    citysss = list(cityss)  # convert to list
    d = dice.fromkeys(citysss)
    for key in d:
        h = []
        for j in s:
            b = j[1]
            lgezi = b[0:2]
            if lgezi == key:
                h.append(j)
            dic[key] = h
    for key in dic:
        print(key, dic[key])

if __name__ == '__main__':
    sp([
        ['王*龙', '北京市海淀区苏州街大恒科技大厦南座4层'],
        ['郭*峰', '河南省商丘市高新技术开发区恒宇食品厂'],
        ['赵*生', '河北省唐山市朝阳道与学院路路口融通大厦2408室'],
        # ... (additional records omitted for brevity)
    ])

Improved Version

# coding: utf-8

def sp(text):
    city = []
    dice = {}
    dic = {}
    address = [info[-1] for info in text]
    for city_info in address:
        city.append(city_info[0:2])
    cities = list(set(city))  # deduplicate and convert to list
    dict_keys = dice.fromkeys(cities)
    for key in dict_keys:
        h = []
        for info in text:
            addr = info[-1]
            city_info = addr[0:2]
            if city_info == key:
                h.append(info)
            dic[key] = h
    for key in dic:
        print(key, dic[key])

if __name__ == '__main__':
    sp([
        ['王*龙', '北京市海淀区苏州街大恒科技大厦南座4层'],
        ['柴*虎', '北京市昌平区北七家镇顺玮阁小区'],
        ['韩*', '辽宁省葫芦岛市小庄子乡宝仓村'],
        # ... (additional records omitted for brevity)
    ])

Regex Alternative

with open("地址信息.txt", 'r', encoding='utf-8') as f:
    for line in f:
        pattern = re.compile(r"\['(?P<name>.*?)', '(?P<address>.*?)'\]", re.S)
        for match in pattern.finditer(line):
            name = match.group('name')
            address = match.group('address')
            print(name, address)

Using Pandas for Province Extraction

df['地区2'] = df.地区.apply(lambda s: s[:(s in ("黑龙江省", "内蒙古自治区")) + 2])

Conclusion

The article demonstrates how basic Python constructs—lists, dictionaries, loops, and simple string slicing—can be combined to parse address data, and shows alternative approaches using regular expressions and pandas for more flexible data handling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data extraction regex Web Scraping Lists Dictionaries

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.