Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm
This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by splitting the work into sub‑tasks and applying Python's multiprocessing, joblib, and tqdm libraries for serial, parallel, and batch processing, showing significant speed‑ups and best‑practice code snippets.
To speed up handling large files, we divide the task into sub‑units, increasing the number of jobs and reducing overall processing time.
For example, when processing a large CSV file with a single column, we feed the data as an array to a function that processes multiple values in parallel based on the number of CPU cores.
In this article we use the multiprocessing , joblib and tqdm Python packages to reduce processing time for any file, database, image, video, or audio.
Setup
We start with the US Accidents (2016‑2021) dataset from Kaggle, which contains 2.8 million records and 47 columns. The dataset is loaded via pandas.read_csv and its shape, column names, and load time are printed.
<code># Parallel Computing
import multiprocessing as mp
from joblib import Parallel, delayed
from tqdm.notebook import tqdm
import pandas as pd
import re
from nltk.corpus import stopwords
import string</code>We set the number of workers to twice the CPU count (e.g., 8 workers on an 8‑core machine):
<code>n_workers = 2 * mp.cpu_count()
print(f"{n_workers} workers are available")</code>Next, we read the CSV file and display its shape and columns:
<code>%%time
file_name = "../input/us-accidents/US_Accidents_Dec21_updated.csv"
df = pd.read_csv(file_name)
print(f"Shape:{df.shape}\n\nColumn Names:\n{df.columns}\n")</code>We define a simple text‑cleaning function that removes stop words, punctuation, and extra spaces using nltk and re :
<code>def clean_text(text):
stops = stopwords.words("english")
text = " ".join([word for word in text.split() if word not in stops])
text = text.translate(str.maketrans('', '', string.punctuation))
text = re.sub(' +', ' ', text)
return text</code>Serial processing
Using pandas .apply() with tqdm.pandas() we process the 2.8 M rows serially, which takes about 9 minutes.
<code>tqdm.pandas()
df['Description'] = df['Description'].progress_apply(clean_text)</code>Multiprocessing with a pool
We create a pool of 8 workers and map the cleaning function to the column, reducing the wall time to roughly 3 minutes 51 seconds.
<code>%%time
p = mp.Pool(n_workers)
df['Description'] = p.map(clean_text, tqdm(df['Description']))</code>Parallel processing with joblib
Using joblib.Parallel and delayed , we achieve similar speed‑ups while keeping the code concise:
<code>def text_parallel_clean(array):
result = Parallel(n_jobs=n_workers, backend="multiprocessing")(
delayed(clean_text)(text) for text in tqdm(array)
)
return result
df['Description'] = text_parallel_clean(df['Description'])</code>Batch parallel processing
We split the column into batches based on the number of workers, process each batch in parallel, and then recombine the results:
<code>def proc_batch(batch):
return [clean_text(text) for text in batch]
def batch_file(array, n_workers):
file_len = len(array)
batch_size = round(file_len / n_workers)
batches = [array[ix:ix+batch_size] for ix in tqdm(range(0, file_len, batch_size))]
return batches
batches = batch_file(df['Description'], n_workers)
batch_output = Parallel(n_jobs=n_workers, backend="multiprocessing")(
delayed(proc_batch)(batch) for batch in tqdm(batches)
)
df['Description'] = [j for i in batch_output for j in i]
</code>Using tqdm's process_map
With tqdm.contrib.concurrent.process_map we obtain the best performance in a single line:
<code>from tqdm.contrib.concurrent import process_map
batch = round(len(df) / n_workers)
df['Description'] = process_map(clean_text, df['Description'], max_workers=n_workers, chunksize=batch)
</code>All methods show that parallel and batch processing can dramatically reduce processing time compared to pure serial execution, and the choice of technique depends on dataset size and complexity.
Conclusion
Finding the right balance between serial, parallel, and batch processing is essential; for smaller or less complex datasets, parallelism may even hurt performance. For large tabular data, libraries such as Dask, datatable, or RAPIDS are recommended for further speed‑ups.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.