Big Data 10 min read

Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a multi‑million‑row CSV dataset by splitting the work into sub‑tasks and applying Python's multiprocessing, joblib, and tqdm libraries for serial, parallel, and batch processing, showing significant speed‑ups and best‑practice code snippets.

Python Programming Learning Circle

Aug 13, 2022

Parallel Processing of Large CSV Files in Python Using multiprocessing, joblib, and tqdm

To speed up handling large files, we divide the task into sub‑units, increasing the number of jobs and reducing overall processing time.

For example, when processing a large CSV file with a single column, we feed the data as an array to a function that processes multiple values in parallel based on the number of CPU cores.

In this article we use the multiprocessing, joblib and tqdm Python packages to reduce processing time for any file, database, image, video, or audio.

Setup

We start with the US Accidents (2016‑2021) dataset from Kaggle, which contains 2.8 million records and 47 columns. The dataset is loaded via pandas.read_csv and its shape, column names, and load time are printed.

# Parallel Computing
import multiprocessing as mp
from joblib import Parallel, delayed
from tqdm.notebook import tqdm
import pandas as pd
import re
from nltk.corpus import stopwords
import string

We set the number of workers to twice the CPU count (e.g., 8 workers on an 8‑core machine):

n_workers = 2 * mp.cpu_count()
print(f"{n_workers} workers are available")

Next, we read the CSV file and display its shape and columns:

%%time
file_name = "../input/us-accidents/US_Accidents_Dec21_updated.csv"
df = pd.read_csv(file_name)
print(f"Shape:{df.shape}

Column Names:
{df.columns}
")

We define a simple text‑cleaning function that removes stop words, punctuation, and extra spaces using nltk and re:

def clean_text(text):
    stops = stopwords.words("english")
    text = " ".join([word for word in text.split() if word not in stops])
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(' +', ' ', text)
    return text

Serial processing

Using

pandas

.apply()

with tqdm.pandas() we process the 2.8 M rows serially, which takes about 9 minutes.

tqdm.pandas()
df['Description'] = df['Description'].progress_apply(clean_text)

Multiprocessing with a pool

We create a pool of 8 workers and map the cleaning function to the column, reducing the wall time to roughly 3 minutes 51 seconds.

%%time
p = mp.Pool(n_workers)
df['Description'] = p.map(clean_text, tqdm(df['Description']))

Parallel processing with joblib

Using joblib.Parallel and delayed, we achieve similar speed‑ups while keeping the code concise:

def text_parallel_clean(array):
    result = Parallel(n_jobs=n_workers, backend="multiprocessing")(
        delayed(clean_text)(text) for text in tqdm(array)
    )
    return result

df['Description'] = text_parallel_clean(df['Description'])

Batch parallel processing

We split the column into batches based on the number of workers, process each batch in parallel, and then recombine the results:

def proc_batch(batch):
    return [clean_text(text) for text in batch]

def batch_file(array, n_workers):
    file_len = len(array)
    batch_size = round(file_len / n_workers)
    batches = [array[ix:ix+batch_size] for ix in tqdm(range(0, file_len, batch_size))]
    return batches

batches = batch_file(df['Description'], n_workers)
batch_output = Parallel(n_jobs=n_workers, backend="multiprocessing")(
    delayed(proc_batch)(batch) for batch in tqdm(batches)
)

df['Description'] = [j for i in batch_output for j in i]

Using tqdm's process_map

With tqdm.contrib.concurrent.process_map we obtain the best performance in a single line:

from tqdm.contrib.concurrent import process_map
batch = round(len(df) / n_workers)
df['Description'] = process_map(clean_text, df['Description'], max_workers=n_workers, chunksize=batch)

All methods show that parallel and batch processing can dramatically reduce processing time compared to pure serial execution, and the choice of technique depends on dataset size and complexity.

Conclusion

Finding the right balance between serial, parallel, and batch processing is essential; for smaller or less complex datasets, parallelism may even hurt performance. For large tabular data, libraries such as Dask, datatable, or RAPIDS are recommended for further speed‑ups.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Python data cleaning parallel processing tqdm Multiprocessing joblib

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.