Big Data 13 min read

Master Shell Tricks to Analyze Beijing Points‑Based Residency Data in Seconds

This article demonstrates how to use standard shell utilities such as grep, cut, sort, uniq, awk, and join to quickly extract insights—like top companies, common surnames, popular given names, age distribution, and hometown rankings—from a JSON dataset of Beijing points‑based residency applicants.

Liangxu Linux

Nov 2, 2020

Master Shell Tricks to Analyze Beijing Points‑Based Residency Data in Seconds

Problem Description

The input is a JSON file whose top‑level key rows holds an array of objects. Each object represents a candidate who obtained Beijing residency through the points‑based system and contains fields such as id, idCard (masked ID number), score, unit (employer), name, and other attributes.

"rows": [
  {
    "id": 62981,
    "idCard": "32092219721222****",
    "unit": "北京利德华福电气技术有限公司",
    "name": "杨效丰",
    "score": 122.59,
    ...
  }
]

The dataset (≈6000 records) can be downloaded from https://www.tanglei.name/resources/use-shell-to-analysis-the-first-people-of-getting-residence-of-beijing-by-score/jifenluohu.json.gz.

Analysis Tasks

Find the top‑10 companies that obtained the most residency quotas.

Identify the most frequent surname among the applicants.

Determine the most popular given name (first two characters of the given name).

Calculate the age distribution of the applicants.

List the top‑10 hometowns (based on the first four digits of the ID card).

(Optional) Find the most common zodiac sign or constellation.

Shell Solutions

Top‑10 Companies

Extract the unit field, count occurrences, sort numerically in descending order, and keep the first ten lines:

grep 'unit' jifenluohu.json \
  | cut -f2 -d: \
  | sort \
  | uniq -c \
  | sort -nr -k1 \
  | head -n 10

Sample output:

137 "北京华为数字技术有限公司"
 73 "中央电视台"
 57 "北京首钢建设集团有限公司"
 55 "百度在线网络技术（北京）有限公司"
 48 "联想（北京）有限公司"
 40 "北京外企人力资源服务有限公司"
 40 "中国民生银行股份有限公司"
 39 "国际商业机器（中国）投资有限公司"
 29 "中国国际技术智力合作有限公司"
 27 "华为技术有限公司北京研究所"

Most Common Surname

Strip the JSON wrapper, keep only the first character of each name, then count and sort:

grep '"name":' jifenluohu.json \
  | sed 's|"name": "||g' \
  | sed 's|[[:space:]]||g' \
  | cut -c1 \
  | sort \
  | uniq -c \
  | sort -nr -k1 \
  | head -n 10

Sample output:

Most Popular Given Name (First Two Characters)

After removing the JSON prefix, cut characters 2‑4 (or 2‑3 for two‑character names) and count:

grep '"name":' jifenluohu.json \
  | sed 's|"name": "||g' \
  | sed 's|[[:space:]]||g' \
  | cut -c2-4 \
  | sort \
  | uniq -c \
  | sort -nr -k1 \
  | head -n 10

Sample output (first two characters of the given name):

Age Distribution

Extract the birth year (characters 9‑12 of idCard), subtract it from the reference year 2019, and count occurrences. Two approaches are shown: using bc or awk.

# Using bc
grep '"idCard":' jifenluohu.json \
  | cut -f2 -d: | cut -c9-12 \
  | xargs -n1 echo 2019 - | bc \
  | sort | uniq -c

# Using awk
grep '"idCard":' jifenluohu.json \
  | cut -f2 -d: | cut -c9-12 \
  | awk '{print 2019-$1}' \
  | sort | uniq -c

Sample output (age → count):

Top‑10 Hometowns

Obtain the four‑digit region code from the ID card (characters 3‑6), count occurrences, then join with a city‑code reference file ( city.csv) to map codes to city names.

# Count region codes
grep '"idCard":' jifenluohu.json \
  | cut -f2 -d: | cut -c3-6 \
  | sort | uniq -c | sort -nr -k1 > topcity.code

# Prepare city code mapping (example snippet)
cat city.csv | grep -E '^[0-9]{4},' | sed 's|,| |g' > city.code4

# Join (both inputs must be sorted)
join -1 1 -2 2 city.code4 <(head -n 10 topcity.code | sort -k2)

Sample joined output (region code → city → count):

1201 天津市市辖区 197
1301 河北省石家庄市 114
1302 河北省唐山市 156
1324 河北省保定地区 103
1501 内蒙古呼和浩特市 88
2101 辽宁省沈阳市 109
2201 吉林省长春市 113
2301 黑龙江哈尔滨市 123
4201 湖北省武汉市 118
6101 陕西省西安市 100

Additional Tip

For more concise JSON handling, the jq tool can replace many of the grep / cut / sed pipelines, but the examples above demonstrate how to solve the tasks using only POSIX utilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data analysis json awk

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.