Efficiently Importing Massive CSV Data into MySQL with Python: pymysql vs pandas‑SQLAlchemy
This article demonstrates two approaches for efficiently importing massive CSV data into MySQL using Python: a direct pymysql method with chunked inserts and a concise pandas‑SQLAlchemy method, comparing performance, code complexity, and offering tips for further speed improvements.
Importing large‑scale data from Python into MySQL is a common task in data engineering. This guide covers two typical scenarios and presents three practical methods for handling tens of millions of rows.
Scenario 1: Infrequent writes
When data does not need to be written to MySQL frequently, the Navicat import wizard can be used. It supports multiple file formats, can auto‑create tables from the file schema, and quickly inserts data into existing tables.
Scenario 2: Incremental and frequent writes
For incremental data that must be continuously loaded, a CSV file with about 12 million rows is used as test data.
import pandas as pd
data = pd.read_csv('./tianchi_mobile_recommend_train_user.csv')
data.shapeMethod 1: pymysql with chunked inserts
Install the pymysql library and use it to connect to MySQL, read the CSV in chunks, and batch‑insert each chunk.
pip install pymysql import pymysql
# Database connection
conn = pymysql.connect(
host='127.0.0.1',
user='root',
passwd='wangyuqing',
db='test01',
port=3306,
charset="utf8"
)
big_size = 100000
with pd.read_csv('./tianchi_mobile_recommend_train_user.csv', chunksize=big_size) as reader:
for df in reader:
datas = []
for i, j in df.iterrows():
data = (j['user_id'], j['item_id'], j['behavior_type'], j['item_category'], j['time'])
datas.append(data)
_values = ",".join(["%s"] * 5)
sql = """insert into users(user_id,item_id,behavior_type,item_category,time) values(%s)""" % _values
cursor = conn.cursor()
cursor.executemany(sql, datas)
conn.commit()
conn.close()
cursor.close()
print('存入成功!')Method 2: pandas + SQLAlchemy
This approach leverages pandas' to_sql method with an SQLAlchemy engine, requiring no pre‑created table.
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://root:wangyuqing@localhost:3306/test01')
data = pd.read_csv('./tianchi_mobile_recommend_train_user.csv')
data.to_sql('user02', engine, chunksize=100000, index=None)
print('存入成功!')Conclusion
The pymysql method took about 12 minutes and involved more code, while the pandas‑SQLAlchemy method required only a few lines and completed in roughly 4 minutes. The second method also eliminates the need to create tables beforehand, making it the recommended solution for both convenience and performance. For even faster ingestion, consider parallel processing techniques.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.