Big Data 12 min read

What Do Gaokao Numbers Reveal? Python-Powered Deep Dive into China’s College Admissions

This article uses Python to scrape and analyze over 2,900 Chinese university and major data points, revealing trends in Gaokao participation, provincial enrollment, university types, popularity rankings, and public curiosity about majors, all illustrated with charts and code examples.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
What Do Gaokao Numbers Reveal? Python-Powered Deep Dive into China’s College Admissions

Overview

The analysis explores Chinese Gaokao (college entrance exam) data from 1977 to 2020, using Python to collect, clean, and visualize information about exam participants, admission rates, university distribution, and major popularity.

Historical Gaokao Participation and Admission Rates

From 1977 to 2019 the number of examinees and admitted students has generally risen. The peak of 10.5 million examinees occurred in 2008; after a slight decline, 2020 set a new record of 10.71 million examinees, the highest in a decade.

Admission numbers have also increased steadily, surpassing one million in 1997. Admission rates grew each year, dipping slightly between 2005‑2008, then climbing rapidly with university expansion, reaching 82 % in 2017.

2019 Provincial First‑Batch Admission Data

In 2019, Henan led with over 1.03 million examinees and a 12.54 % first‑batch admission rate (12.92 k students). Guangdong and Sichuan followed with 760 k and 650 k examinees, and first‑batch rates of 12.87 % and 14.72 % respectively.

University Distribution by Province

Jiangsu tops the list with 174 universities, followed by Beijing (167), Shandong (161) and Guangdong (161).

University Levels

Beijing hosts the most elite institutions: 27 211‑level universities and 9 985‑level universities, the highest among all provinces.

University Types

Engineering (理工) institutions dominate, accounting for 30.93 % of all universities. Comprehensive universities follow at 29.14 %, and teacher‑training schools make up 8.7 %.

University Popularity Ranking

Based on search‑based popularity scores, Xiamen University ranks first, followed by Wuhan University, Sichuan University, and then Peking and Tsinghua Universities.

Major Subject Distribution

Engineering majors are the most numerous (212 sub‑majors), while philosophy has the fewest (4 sub‑majors). Literature follows with 122 sub‑majors.

Major Popularity

Clinical medicine searches top the list, followed by business economics and electrical engineering with intelligent control.

Public Curiosity About Majors

On social media, psychology ranks first in public interest, with nursing second and archaeology third.

Data Acquisition

The dataset (2 904 university records and 1 450 major records) was obtained by scraping the China Education Online website using Python.

# Import packages
import numpy as np
import pandas as pd
import requests
import json
from fake_useragent import UserAgent
import time

# Get one page
def get_one_page(page_num):
    url = 'https://api.eol.cn/gkcx/api/'
    headers = {
        'User-Agent': UserAgent().random,
        'Origin': 'https://gkcx.eol.cn',
        'Referer': 'https://gkcx.eol.cn/school/search?province=&schoolflag=&recomschprop='
    }
    data = {
        'access_token': "",
        'admissions': "",
        'central': "",
        'department': "",
        'dual_class': "",
        'f211': "",
        'f985': "",
        'is_dual_class': "",
        'keyword': "",
        'page': page_num,
        'province_id': "",
        'request_type': 1,
        'school_type': "",
        'size': 20,
        'sort': "view_total",
        'type': "",
        'uri': "apigkcx/api/school/hotlists"
    }
    try:
        response = requests.post(url=url, data=data, headers=headers)
    except Exception as e:
        print(e)
        time.sleep(3)
        response = requests.post(url=url, data=data, headers=headers)
    school_data = json.loads(response.text)['data']['item']
    # Extract fields
    school_name = [i.get('name') for i in school_data]
    belong = [i.get('belong') for i in school_data]
    dual_class_name = [i.get('dual_class_name') for i in school_data]
    f985 = [i.get('f985') for i in school_data]
    f211 = [i.get('f211') for i in school_data]
    level_name = [i.get('level_name') for i in school_data]
    type_name = [i.get('type_name') for i in school_data]
    nature_name = [i.get('nature_name') for i in school_data]
    view_total = [i.get('view_total') for i in school_data]
    province_name = [i.get('province_name') for i in school_data]
    city_name = [i.get('city_name') for i in school_data]
    county_name = [i.get('county_name') for i in school_data]
    df_one = pd.DataFrame({
        'school_name': school_name,
        'belong': belong,
        'dual_class_name': dual_class_name,
        'f985': f985,
        'f211': f211,
        'level_name': level_name,
        'type_name': type_name,
        'nature_name': nature_name,
        'view_total': view_total,
        'province_name': province_name,
        'city_name': city_name,
        'county_name': county_name,
    })
    return df_one

# Get all pages
def get_all_page(all_page_num):
    df_all = pd.DataFrame()
    for i in range(all_page_num):
        print(f'正在获取第{i + 1}页的高校信息')
        df_one = get_one_page(page_num=i+1)
        df_all = df_all.append(df_one, ignore_index=True)
        time.sleep(np.random.uniform(2))
    return df_all

if __name__ == '__main__':
    df = get_all_page(all_page_num=148)

Data Processing Steps

Analyze network requests to locate the real API endpoint and request method.

Use requests to fetch JSON data.

Parse the JSON and extract relevant fields.

Store the cleaned data in pandas DataFrames for further analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataPythonGaokaohigher education
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.