What Do Gaokao Numbers Reveal? Python-Powered Deep Dive into China’s College Admissions
This article uses Python to scrape and analyze over 2,900 Chinese university and major data points, revealing trends in Gaokao participation, provincial enrollment, university types, popularity rankings, and public curiosity about majors, all illustrated with charts and code examples.
Overview
The analysis explores Chinese Gaokao (college entrance exam) data from 1977 to 2020, using Python to collect, clean, and visualize information about exam participants, admission rates, university distribution, and major popularity.
Historical Gaokao Participation and Admission Rates
From 1977 to 2019 the number of examinees and admitted students has generally risen. The peak of 10.5 million examinees occurred in 2008; after a slight decline, 2020 set a new record of 10.71 million examinees, the highest in a decade.
Admission numbers have also increased steadily, surpassing one million in 1997. Admission rates grew each year, dipping slightly between 2005‑2008, then climbing rapidly with university expansion, reaching 82 % in 2017.
2019 Provincial First‑Batch Admission Data
In 2019, Henan led with over 1.03 million examinees and a 12.54 % first‑batch admission rate (12.92 k students). Guangdong and Sichuan followed with 760 k and 650 k examinees, and first‑batch rates of 12.87 % and 14.72 % respectively.
University Distribution by Province
Jiangsu tops the list with 174 universities, followed by Beijing (167), Shandong (161) and Guangdong (161).
University Levels
Beijing hosts the most elite institutions: 27 211‑level universities and 9 985‑level universities, the highest among all provinces.
University Types
Engineering (理工) institutions dominate, accounting for 30.93 % of all universities. Comprehensive universities follow at 29.14 %, and teacher‑training schools make up 8.7 %.
University Popularity Ranking
Based on search‑based popularity scores, Xiamen University ranks first, followed by Wuhan University, Sichuan University, and then Peking and Tsinghua Universities.
Major Subject Distribution
Engineering majors are the most numerous (212 sub‑majors), while philosophy has the fewest (4 sub‑majors). Literature follows with 122 sub‑majors.
Major Popularity
Clinical medicine searches top the list, followed by business economics and electrical engineering with intelligent control.
Public Curiosity About Majors
On social media, psychology ranks first in public interest, with nursing second and archaeology third.
Data Acquisition
The dataset (2 904 university records and 1 450 major records) was obtained by scraping the China Education Online website using Python.
# Import packages
import numpy as np
import pandas as pd
import requests
import json
from fake_useragent import UserAgent
import time
# Get one page
def get_one_page(page_num):
url = 'https://api.eol.cn/gkcx/api/'
headers = {
'User-Agent': UserAgent().random,
'Origin': 'https://gkcx.eol.cn',
'Referer': 'https://gkcx.eol.cn/school/search?province=&schoolflag=&recomschprop='
}
data = {
'access_token': "",
'admissions': "",
'central': "",
'department': "",
'dual_class': "",
'f211': "",
'f985': "",
'is_dual_class': "",
'keyword': "",
'page': page_num,
'province_id': "",
'request_type': 1,
'school_type': "",
'size': 20,
'sort': "view_total",
'type': "",
'uri': "apigkcx/api/school/hotlists"
}
try:
response = requests.post(url=url, data=data, headers=headers)
except Exception as e:
print(e)
time.sleep(3)
response = requests.post(url=url, data=data, headers=headers)
school_data = json.loads(response.text)['data']['item']
# Extract fields
school_name = [i.get('name') for i in school_data]
belong = [i.get('belong') for i in school_data]
dual_class_name = [i.get('dual_class_name') for i in school_data]
f985 = [i.get('f985') for i in school_data]
f211 = [i.get('f211') for i in school_data]
level_name = [i.get('level_name') for i in school_data]
type_name = [i.get('type_name') for i in school_data]
nature_name = [i.get('nature_name') for i in school_data]
view_total = [i.get('view_total') for i in school_data]
province_name = [i.get('province_name') for i in school_data]
city_name = [i.get('city_name') for i in school_data]
county_name = [i.get('county_name') for i in school_data]
df_one = pd.DataFrame({
'school_name': school_name,
'belong': belong,
'dual_class_name': dual_class_name,
'f985': f985,
'f211': f211,
'level_name': level_name,
'type_name': type_name,
'nature_name': nature_name,
'view_total': view_total,
'province_name': province_name,
'city_name': city_name,
'county_name': county_name,
})
return df_one
# Get all pages
def get_all_page(all_page_num):
df_all = pd.DataFrame()
for i in range(all_page_num):
print(f'正在获取第{i + 1}页的高校信息')
df_one = get_one_page(page_num=i+1)
df_all = df_all.append(df_one, ignore_index=True)
time.sleep(np.random.uniform(2))
return df_all
if __name__ == '__main__':
df = get_all_page(all_page_num=148)Data Processing Steps
Analyze network requests to locate the real API endpoint and request method.
Use requests to fetch JSON data.
Parse the JSON and extract relevant fields.
Store the cleaned data in pandas DataFrames for further analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
