Big Data 10 min read

How to Scrape and Visualize All Chinese University Data with Python Asyncio

This article demonstrates a complete Python workflow—using aiohttp and asyncio to crawl nationwide university information, processing the JSON data into CSV files with pandas, and visualizing the results in Tableau—providing a practical guide for large‑scale data collection and analysis.

Python Crawling & Data Mining

Jul 22, 2022

How to Scrape and Visualize All Chinese University Data with Python Asyncio

The author, Ding Xiaojie, shares a Python‑based workflow to collect, process, and visualize nationwide university information.

Data Crawling

Target URL: https://www.gaokao.cn/school/140. By opening the developer tools (F12) and capturing the JSON request https://static-data.eol.cn/www/2.0/school/140/info.json, the university data can be retrieved directly.

Since the school ID (e.g., 140) is not continuous, the crawler generates URLs based on an estimated total number of schools.

Crawling Code

Import Modules

import aiohttp
import asyncio
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import time

Module purposes:

aiohttp : Enables asynchronous HTTP requests.

asyncio : Provides the event loop and coroutine management.

pandas : Converts the scraped data into a DataFrame and writes CSV files.

pathlib : Handles filesystem paths in an object‑oriented way.

tqdm : Adds a progress bar to any iterable.

Generate URL List

def get_url_list(max_id):
    url = 'https://static-data.eol.cn/www/2.0/school/%d/info.json'
    not_crawled = set(range(max_id))
    if Path.exists(Path(current_path, 'college_info.csv')):
        df = pd.read_csv(Path(current_path, 'college_info.csv'))
        not_crawled -= set(df['学校id'].unique())
    return [url % id for id in not_crawled]

Fetch JSON Data

async def get_json_data(url, semaphore):
    async with semaphore:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
        }
        async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
            try:
                async with session.get(url=url, headers=headers, timeout=6) as response:
                    response.encoding = 'utf-8'
                    json_data = await response.json()
                    if json_data != '':
                        return save_to_csv(json_data['data'])
            except:
                return None

Save to CSV

def save_to_csv(json_info):
    save_info = {}
    save_info['学校id'] = json_info['school_id']
    save_info['学校名称'] = json_info['name']
    level = ""
    if json_info['f985'] == '1' and json_info['f211'] == '1':
        level += "985 211"
    elif json_info['f211'] == '1':
        level += "211"
    else:
        level += json_info['level_name']
    save_info['学校层次'] = level
    save_info['软科排名'] = json_info['rank']['ruanke_rank']
    save_info['校友会排名'] = json_info['rank']['xyh_rank']
    save_info['武书连排名'] = json_info['rank']['wsl_rank']
    save_info['QS世界排名'] = json_info['rank']['qs_world']
    save_info['US世界排名'] = json_info['rank']['us_rank']
    save_info['学校类型'] = json_info['type_name']
    save_info['省份'] = json_info['province_name']
    save_info['城市'] = json_info['city_name']
    save_info['所处地区'] = json_info['town_name']
    save_info['招生办电话'] = json_info['phone']
    save_info['招生办官网'] = json_info['site']
    df = pd.DataFrame(save_info, index=[0])
    header = False if Path.exists(Path(current_path, 'college_info.csv')) else True
    df.to_csv(Path(current_path, 'college_info.csv'), index=False, mode='a', header=header)

Scheduler

async def main(loop):
    # Get URL list
    url_list = get_url_list(5000)
    # Limit concurrency
    semaphore = asyncio.Semaphore(500)
    # Create tasks
    tasks = [loop.create_task(get_json_data(url, semaphore)) for url in url_list]
    # Await tasks with progress bar
    for t in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
        await t

Running the program twice (first 2140 rows, second 680 rows) collected a total of 2820 university records.

Tableau Visualization

The collected data is visualized using Tableau Public. Four main charts are created:

University Count Distribution Map

Top three provinces by university count: Jiangsu, Guangdong, Henan.

Soft Science Top‑10 Rankings

Most top‑10 schools are comprehensive; only the University of Science and Technology of China appears as a technical university.

University Level Distribution

Approximately 9.5% of institutions are 211 universities and 3.5% are 985 universities.

University Type Distribution

Science/engineering and comprehensive universities dominate, followed by finance, teacher‑training, and medical schools.

Combined Dashboard

The dashboard links the map to other charts via filter actions, enabling interactive exploration of the dataset.

In summary, the article covers the end‑to‑end process of data acquisition, cleaning, storage, and visualization for Chinese university information using Python asynchronous programming and Tableau.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

asyncio Tableau University Data

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.