Python Web Scraping Tutorial: Collecting Listed Company Data and Storing It in MySQL

This article walks through a step‑by‑step Python web‑scraping tutorial that fetches financial data of over 3000 listed companies from a public website, parses the tables with pandas, enhances the script with error handling, URL flexibility, MySQL storage, and multiprocessing to speed up the crawl.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Python Web Scraping Tutorial: Collecting Listed Company Data and Storing It in MySQL

Introduction – The author explains that writing a simple web scraper is an easy way to start learning Python, emphasizing that the initial goal is just to retrieve data successfully before worrying about speed, storage, or code organization.

Basic Environment Configuration – Python 3 on Windows is used, with the pandas and csv modules required for data handling.

Target Website – The scraper targets the AskCI stock information pages, which list company details for a given reporting date.

Initial Implementation

import pandas as pd
import csv
for i in range(1,178):  # 爬取全部页
    tb = pd.read_html('http://s.askci.com/stock/a/?reportTime=2017-12-31&pageNum=%s' % (str(i)))[3]
    tb.to_csv(r'1.csv', mode='a', encoding='utf_8_sig', header=1, index=0)

The first version quickly writes more than 3000 company records into an Excel‑compatible CSV file.

Improvements

Added exception handling (try/except) to make the crawler robust against network failures.

Made the URL parameters configurable so the script can scrape different dates or pages.

Switched storage from CSV to MySQL to practice database operations and avoid the limitations of flat files.

Implemented multiprocessing to accelerate the crawl across 178 pages.

Enhanced Code (simplified view)

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import time, pymysql
from sqlalchemy import create_engine
from multiprocessing import Pool

start_time = time.time()

def get_one_page(i):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}
        paras = {'reportTime': '2017-12-31', 'pageNum': i}
        url = 'http://s.askci.com/stock/a/?' + urlencode(paras)
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
    except Exception:
        print('爬取失败')
    return None

def parse_one_page(html):
    soup = BeautifulSoup(html, 'lxml')
    content = soup.select('#myTable04')[0]
    tbl = pd.read_html(content.prettify(), header=0)[0]
    tbl.rename(columns={
        '序号':'serial_number','股票代码':'stock_code','股票简称':'stock_abbre','公司名称':'company_name',
        '省份':'province','城市':'city','主营业务收入(201712)':'main_bussiness_income','净利润(201712)':'net_profit',
        '员工人数':'employees','上市日期':'listing_date','招股书':'zhaogushu','公司财报':'financial_report',
        '行业分类':'industry_classification','产品类型':'industry_type','主营业务':'main_business'
    }, inplace=True)
    return tbl

def generate_mysql():
    conn = pymysql.connect(host='localhost', user='root', password='******', port=3306, charset='utf8', db='wade')
    cursor = conn.cursor()
    sql = '''CREATE TABLE IF NOT EXISTS listed_company (
        serial_number INT NOT NULL,
        stock_code INT,
        stock_abbre VARCHAR(20),
        company_name VARCHAR(20),
        province VARCHAR(20),
        city VARCHAR(20),
        main_bussiness_income VARCHAR(20),
        net_profit VARCHAR(20),
        employees INT,
        listing_date DATETIME,
        zhaogushu VARCHAR(20),
        financial_report VARCHAR(20),
        industry_classification VARCHAR(20),
        industry_type VARCHAR(100),
        main_business VARCHAR(200),
        PRIMARY KEY (serial_number)
    )'''
    cursor.execute(sql)
    conn.close()

def write_to_sql(tbl, db='wade'):
    engine = create_engine(f'mysql+pymysql://root:******@localhost:3306/{db}?charset=utf8')
    try:
        tbl.to_sql('listed_company2', con=engine, if_exists='append', index=False)
    except Exception as e:
        print(e)

def main(page):
    generate_mysql()
    for i in range(1, page):
        html = get_one_page(i)
        if html:
            tbl = parse_one_page(html)
            write_to_sql(tbl)

if __name__ == '__main__':
    main(178)
    endtime = time.time() - start_time
    print('程序运行了%.2f秒' % endtime)

# Multiprocessing version
if __name__ == '__main__':
    pool = Pool(4)
    pool.map(main, [i for i in range(1, 178)])
    endtime = time.time() - start_time
    print('程序运行了%.2f秒' % endtime)

Conclusion – By iteratively adding small improvements—exception handling, configurable URLs, MySQL storage, and parallel execution—the script grows from a few lines to a robust, maintainable data‑collection tool, illustrating a practical learning path for Python developers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

pandasmultiprocessingweb-scrapingdata-extraction
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.