Extract Province Information from Chinese Addresses Using Python
This article walks through a Python‑based solution for parsing a list of Chinese address strings, extracting province abbreviations, grouping records by province, and optionally using regular expressions or pandas for further analysis, complete with full source code examples.
Introduction
The author, a Python enthusiast, shares a practical problem encountered in a chat group: extracting province or municipality information from a collection of delivery address strings.
Problem Statement
Given a list where each element contains a name and a full address, the goal is to separate the address into its province component and group the original records by that province.
Approach
The solution reads the address list, slices the first two characters of each address (the province abbreviation), removes duplicates, and then builds a dictionary that maps each province to the list of matching records.
Solution Code
# coding: utf-8
def sp(s):
citys = []
dizhi = []
dice = {}
dic = {}
for i in s:
a = i[1]
city = a[0:2]
zlib = a[0:2]
citys.append(city)
dizhi.append(zlib)
cityss = set(citys) # remove duplicates
citysss = list(cityss) # convert to list
d = dice.fromkeys(citysss)
for key in d:
h = []
for j in s:
b = j[1]
lgezi = b[0:2]
if lgezi == key:
h.append(j)
dic[key] = h
for key in dic:
print(key, dic[key])
if __name__ == '__main__':
sp([
['王*龙', '北京市海淀区苏州街大恒科技大厦南座4层'],
['郭*峰', '河南省商丘市高新技术开发区恒宇食品厂'],
['赵*生', '河北省唐山市朝阳道与学院路路口融通大厦2408室'],
# ... (additional records omitted for brevity)
])Improved Version
# coding: utf-8
def sp(text):
city = []
dice = {}
dic = {}
address = [info[-1] for info in text]
for city_info in address:
city.append(city_info[0:2])
cities = list(set(city)) # deduplicate and convert to list
dict_keys = dice.fromkeys(cities)
for key in dict_keys:
h = []
for info in text:
addr = info[-1]
city_info = addr[0:2]
if city_info == key:
h.append(info)
dic[key] = h
for key in dic:
print(key, dic[key])
if __name__ == '__main__':
sp([
['王*龙', '北京市海淀区苏州街大恒科技大厦南座4层'],
['柴*虎', '北京市昌平区北七家镇顺玮阁小区'],
['韩*', '辽宁省葫芦岛市小庄子乡宝仓村'],
# ... (additional records omitted for brevity)
])Regex Alternative
with open("地址信息.txt", 'r', encoding='utf-8') as f:
for line in f:
pattern = re.compile(r"\['(?P<name>.*?)', '(?P<address>.*?)'\]", re.S)
for match in pattern.finditer(line):
name = match.group('name')
address = match.group('address')
print(name, address)Using Pandas for Province Extraction
df['地区2'] = df.地区.apply(lambda s: s[:(s in ("黑龙江省", "内蒙古自治区")) + 2])Conclusion
The article demonstrates how basic Python constructs—lists, dictionaries, loops, and simple string slicing—can be combined to parse address data, and shows alternative approaches using regular expressions and pandas for more flexible data handling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
