How to Scrape and Visualize All Chinese University Data with Python Asyncio
This article demonstrates a complete Python workflow—using aiohttp and asyncio to crawl nationwide university information, processing the JSON data into CSV files with pandas, and visualizing the results in Tableau—providing a practical guide for large‑scale data collection and analysis.
The author, Ding Xiaojie, shares a Python‑based workflow to collect, process, and visualize nationwide university information.
Data Crawling
Target URL: https://www.gaokao.cn/school/140. By opening the developer tools (F12) and capturing the JSON request https://static-data.eol.cn/www/2.0/school/140/info.json, the university data can be retrieved directly.
Since the school ID (e.g., 140) is not continuous, the crawler generates URLs based on an estimated total number of schools.
Crawling Code
Import Modules
import aiohttp
import asyncio
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import timeModule purposes:
aiohttp : Enables asynchronous HTTP requests.
asyncio : Provides the event loop and coroutine management.
pandas : Converts the scraped data into a DataFrame and writes CSV files.
pathlib : Handles filesystem paths in an object‑oriented way.
tqdm : Adds a progress bar to any iterable.
Generate URL List
def get_url_list(max_id):
url = 'https://static-data.eol.cn/www/2.0/school/%d/info.json'
not_crawled = set(range(max_id))
if Path.exists(Path(current_path, 'college_info.csv')):
df = pd.read_csv(Path(current_path, 'college_info.csv'))
not_crawled -= set(df['学校id'].unique())
return [url % id for id in not_crawled]Fetch JSON Data
async def get_json_data(url, semaphore):
async with semaphore:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
}
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
try:
async with session.get(url=url, headers=headers, timeout=6) as response:
response.encoding = 'utf-8'
json_data = await response.json()
if json_data != '':
return save_to_csv(json_data['data'])
except:
return NoneSave to CSV
def save_to_csv(json_info):
save_info = {}
save_info['学校id'] = json_info['school_id']
save_info['学校名称'] = json_info['name']
level = ""
if json_info['f985'] == '1' and json_info['f211'] == '1':
level += "985 211"
elif json_info['f211'] == '1':
level += "211"
else:
level += json_info['level_name']
save_info['学校层次'] = level
save_info['软科排名'] = json_info['rank']['ruanke_rank']
save_info['校友会排名'] = json_info['rank']['xyh_rank']
save_info['武书连排名'] = json_info['rank']['wsl_rank']
save_info['QS世界排名'] = json_info['rank']['qs_world']
save_info['US世界排名'] = json_info['rank']['us_rank']
save_info['学校类型'] = json_info['type_name']
save_info['省份'] = json_info['province_name']
save_info['城市'] = json_info['city_name']
save_info['所处地区'] = json_info['town_name']
save_info['招生办电话'] = json_info['phone']
save_info['招生办官网'] = json_info['site']
df = pd.DataFrame(save_info, index=[0])
header = False if Path.exists(Path(current_path, 'college_info.csv')) else True
df.to_csv(Path(current_path, 'college_info.csv'), index=False, mode='a', header=header)Scheduler
async def main(loop):
# Get URL list
url_list = get_url_list(5000)
# Limit concurrency
semaphore = asyncio.Semaphore(500)
# Create tasks
tasks = [loop.create_task(get_json_data(url, semaphore)) for url in url_list]
# Await tasks with progress bar
for t in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
await tRunning the program twice (first 2140 rows, second 680 rows) collected a total of 2820 university records.
Tableau Visualization
The collected data is visualized using Tableau Public. Four main charts are created:
University Count Distribution Map
Top three provinces by university count: Jiangsu, Guangdong, Henan.
Soft Science Top‑10 Rankings
Most top‑10 schools are comprehensive; only the University of Science and Technology of China appears as a technical university.
University Level Distribution
Approximately 9.5% of institutions are 211 universities and 3.5% are 985 universities.
University Type Distribution
Science/engineering and comprehensive universities dominate, followed by finance, teacher‑training, and medical schools.
Combined Dashboard
The dashboard links the map to other charts via filter actions, enabling interactive exploration of the dataset.
In summary, the article covers the end‑to‑end process of data acquisition, cleaning, storage, and visualization for Chinese university information using Python asynchronous programming and Tableau.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
