Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm
This tutorial demonstrates how to accelerate processing of a 2.8‑million‑row CSV dataset by using Python's multiprocessing, joblib, and tqdm libraries, covering serial, parallel, and batch processing techniques, performance measurements, and best‑practice code examples for efficient large‑scale data handling.
To speed up handling of large files, the tutorial splits the work into sub‑units, increasing the number of jobs and reducing overall processing time.
The example uses the Kaggle US Accidents (2016‑2021) dataset, which contains about 2.8 million records and 47 columns.
Required libraries are imported:
import multiprocessing as mp
from joblib import Parallel, delayed
from tqdm.notebook import tqdm
import pandas as pd
import re
from nltk.corpus import stopwords
import stringThe number of workers is set to twice the CPU core count:
n_workers = 2 * mp.cpu_count()
print(f"{n_workers} workers are available")The CSV file is read with pd.read_csv and its shape, column names and timing are displayed.
%%time
file_name = "../input/us-accidents/US_Accidents_Dec21_updated.csv"
df = pd.read_csv(file_name)
print(f"Shape:{df.shape}\n\nColumn Names:\n{df.columns}\n")A clean_text function removes English stop words, punctuation and extra spaces:
def clean_text(text):
# Remove stop words
stops = stopwords.words("english")
text = " ".join([word for word in text.split() if word not in stops])
# Remove Special Characters
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove extra spaces
text = re.sub(' +', ' ', text)
return textSerial processing uses tqdm.pandas() and .progress_apply on the “Description” column; it takes about 9 minutes 5 seconds.
%%time
tqdm.pandas()
df['Description'] = df['Description'].progress_apply(clean_text)Multiprocessing with a pool of workers applies the function via map , reducing the time to roughly 3 minutes 51 seconds.
%%time
p = mp.Pool(n_workers)
df['Description'] = p.map(clean_text, tqdm(df['Description']))Joblib’s Parallel and delayed achieve similar speed; the helper text_parallel_clean wraps the call.
def text_parallel_clean(array):
result = Parallel(n_jobs=n_workers, backend="multiprocessing")(
delayed(clean_text)(text) for text in tqdm(array)
)
return result
%%time
df['Description'] = text_parallel_clean(df['Description'])Batch processing splits the data into chunks equal to the number of workers, processes each batch in parallel, and recombines the results.
def proc_batch(batch):
return [clean_text(text) for text in batch]
def batch_file(array, n_workers):
file_len = len(array)
batch_size = round(file_len / n_workers)
batches = [array[ix:ix+batch_size] for ix in tqdm(range(0, file_len, batch_size))]
return batches
batches = batch_file(df['Description'], n_workers)
%%time
batch_output = Parallel(n_jobs=n_workers, backend="multiprocessing")(
delayed(proc_batch)(batch) for batch in tqdm(batches)
)
df['Description'] = [j for i in batch_output for j in i]The tqdm.contrib.concurrent.process_map function provides a concise one‑liner that yields the best timing (about 3 minutes 51 seconds).
%%time
from tqdm.contrib.concurrent import process_map
batch = round(len(df) / n_workers)
df['Description'] = process_map(clean_text, df['Description'], max_workers=n_workers, chunksize=batch)Conclusion: depending on dataset size and complexity, one should choose between serial, parallel, or batch processing; for smaller or less complex data, parallelism may be counter‑productive, and tools like Dask, datatable, or RAPIDS can be considered for further acceleration.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.