Parallel Processing of Large CSV Files with multiprocessing, joblib, and tqdm in Python
This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by dividing the work into parallel tasks using Python's multiprocessing, joblib, and tqdm libraries, comparing serial, multi‑process, batch, and process‑map approaches with detailed timing results.
To achieve parallel processing, the task is split into sub‑units, increasing the number of jobs and reducing overall execution time.
The example uses the US Accidents (2016‑2021) dataset from Kaggle (2.8 million rows, 47 columns) and imports multiprocessing as mp, from joblib import Parallel, delayed, from tqdm.notebook import tqdm, import pandas as pd, import re, from nltk.corpus import stopwords, and import string.
Workers are set by doubling the CPU count: n_workers = 2 * mp.cpu_count() (e.g., 8 workers).
Serial processing uses tqdm.pandas() and
df['Description'] = df['Description'].progress_apply(clean_text), taking about 9 minutes 5 seconds for 2.8 M rows.
Multiprocessing with Pool creates a pool of workers and maps the cleaning function: p = mp.Pool(n_workers) followed by
df['Description'] = p.map(clean_text, tqdm(df['Description'])), reducing time to roughly 3 minutes 51 seconds.
Joblib Parallel defines
def text_parallel_clean(array):
result = Parallel(n_jobs=n_workers, backend="multiprocessing")(delayed(clean_text)(text) for text in tqdm(array))
return resultand applies it to the column, achieving a runtime of about 4 minutes 4 seconds.
Batch processing splits the data into batches with
def batch_file(array, n_workers):
file_len = len(array)
batch_size = round(file_len / n_workers)
batches = [array[ix:ix+batch_size] for ix in tqdm(range(0, file_len, batch_size))]
return batches, then processes each batch in parallel using Parallel and delayed(proc_batch), resulting in a wall‑time of approximately 3 minutes 56 seconds.
tqdm.contrib.concurrent.process_map offers a concise one‑liner:
df['Description'] = process_map(clean_text, df['Description'], max_workers=n_workers, chunksize=batch), delivering the best performance (about 3 minutes 51 seconds).
Conclusion emphasizes selecting the appropriate method—serial, parallel, or batch—based on dataset size and complexity, and suggests alternatives like Dask, datatable, or RAPIDS for further performance gains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
