Databases 5 min read

Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy

This article demonstrates two approaches for efficiently importing massive CSV data into MySQL using Python: a direct pymysql method with chunked inserts and a concise pandas‑SQLAlchemy method, comparing performance, code complexity, and offering tips for further speed improvements.

IT Services Circle

Jun 18, 2022

Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy

Importing large‑scale data from Python into MySQL is a common task in data engineering. This guide covers two typical scenarios and presents three practical methods for handling tens of millions of rows.

Scenario 1: Infrequent writes

When data does not need to be written to MySQL frequently, the Navicat import wizard can be used. It supports multiple file formats, can auto‑create tables from the file schema, and quickly inserts data into existing tables.

Scenario 2: Incremental and frequent writes

For incremental data that must be continuously loaded, a CSV file with about 12 million rows is used as test data.

import pandas as pd
data = pd.read_csv('./tianchi_mobile_recommend_train_user.csv')
data.shape

Method 1: pymysql with chunked inserts

Install the pymysql library and use it to connect to MySQL, read the CSV in chunks, and batch‑insert each chunk.

pip install pymysql

import pymysql
# Database connection
conn = pymysql.connect(
    host='127.0.0.1',
    user='root',
    passwd='wangyuqing',
    db='test01',
    port=3306,
    charset="utf8"
)
big_size = 100000
with pd.read_csv('./tianchi_mobile_recommend_train_user.csv', chunksize=big_size) as reader:
    for df in reader:
        datas = []
        for i, j in df.iterrows():
            data = (j['user_id'], j['item_id'], j['behavior_type'], j['item_category'], j['time'])
            datas.append(data)
        _values = ",".join(["%s"] * 5)
        sql = """insert into users(user_id,item_id,behavior_type,item_category,time) values(%s)""" % _values
        cursor = conn.cursor()
        cursor.executemany(sql, datas)
        conn.commit()
conn.close()
cursor.close()
print('存入成功！')

Method 2: pandas + SQLAlchemy

This approach leverages pandas' to_sql method with an SQLAlchemy engine, requiring no pre‑created table.

from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://root:wangyuqing@localhost:3306/test01')
data = pd.read_csv('./tianchi_mobile_recommend_train_user.csv')
data.to_sql('user02', engine, chunksize=100000, index=None)
print('存入成功！')

Conclusion

The pymysql method took about 12 minutes and involved more code, while the pandas‑SQLAlchemy method required only a few lines and completed in roughly 4 minutes. The second method also eliminates the need to create tables beforehand, making it the recommended solution for both convenience and performance. For even faster ingestion, consider parallel processing techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python MySQL Pandas data ingestion SQLAlchemy pymysql Large-Scale Data

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Scenario 1: Infrequent writes

Scenario 2: Incremental and frequent writes

Method 1: pymysql with chunked inserts

Method 2: pandas + SQLAlchemy

Conclusion

IT Services Circle

How this landed with the community

Was this worth your time?

0 Comments

Scenario 1: Infrequent writes

Scenario 2: Incremental and frequent writes

Method 1: pymysql with chunked inserts

Method 2: pandas + SQLAlchemy