Big Data 9 min read

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

This tutorial demonstrates how to accelerate processing of a 2.8‑million‑row CSV dataset by using Python's multiprocessing, joblib, and tqdm libraries, covering serial, parallel, and batch processing techniques, performance measurements, and best‑practice code examples for efficient large‑scale data handling.

Python Programming Learning Circle

Apr 23, 2023

Parallel Processing of Large CSV Files in Python with multiprocessing, joblib, and tqdm

To speed up handling of large files, the tutorial splits the work into sub‑units, increasing the number of jobs and reducing overall processing time.

The example uses the Kaggle US Accidents (2016‑2021) dataset, which contains about 2.8 million records and 47 columns.

Required libraries are imported:

import multiprocessing as mp
from joblib import Parallel, delayed
from tqdm.notebook import tqdm
import pandas as pd
import re
from nltk.corpus import stopwords
import string

The number of workers is set to twice the CPU core count:

n_workers = 2 * mp.cpu_count()
print(f"{n_workers} workers are available")

The CSV file is read with pd.read_csv and its shape, column names and timing are displayed.

%%time
file_name = "../input/us-accidents/US_Accidents_Dec21_updated.csv"
df = pd.read_csv(file_name)
print(f"Shape:{df.shape}

Column Names:
{df.columns}
")

A clean_text function removes English stop words, punctuation and extra spaces:

def clean_text(text):
    # Remove stop words
    stops = stopwords.words("english")
    text = " ".join([word for word in text.split() if word not in stops])
    # Remove Special Characters
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove extra spaces
    text = re.sub(' +', ' ', text)
    return text

Serial processing uses tqdm.pandas() and .progress_apply on the “Description” column; it takes about 9 minutes 5 seconds.

%%time
tqdm.pandas()
df['Description'] = df['Description'].progress_apply(clean_text)

Multiprocessing with a pool of workers applies the function via map, reducing the time to roughly 3 minutes 51 seconds.

%%time
p = mp.Pool(n_workers)
df['Description'] = p.map(clean_text, tqdm(df['Description']))

Joblib’s Parallel and delayed achieve similar speed; the helper text_parallel_clean wraps the call.

def text_parallel_clean(array):
    result = Parallel(n_jobs=n_workers, backend="multiprocessing")(
        delayed(clean_text)(text) for text in tqdm(array)
    )
    return result

%%time
df['Description'] = text_parallel_clean(df['Description'])

Batch processing splits the data into chunks equal to the number of workers, processes each batch in parallel, and recombines the results.

def proc_batch(batch):
    return [clean_text(text) for text in batch]

def batch_file(array, n_workers):
    file_len = len(array)
    batch_size = round(file_len / n_workers)
    batches = [array[ix:ix+batch_size] for ix in tqdm(range(0, file_len, batch_size))]
    return batches

batches = batch_file(df['Description'], n_workers)

%%time
batch_output = Parallel(n_jobs=n_workers, backend="multiprocessing")(
    delayed(proc_batch)(batch) for batch in tqdm(batches)
)
df['Description'] = [j for i in batch_output for j in i]

The tqdm.contrib.concurrent.process_map function provides a concise one‑liner that yields the best timing (about 3 minutes 51 seconds).

%%time
from tqdm.contrib.concurrent import process_map
batch = round(len(df) / n_workers)
df['Description'] = process_map(clean_text, df['Description'], max_workers=n_workers, chunksize=batch)

Conclusion: depending on dataset size and complexity, one should choose between serial, parallel, or batch processing; for smaller or less complex data, parallelism may be counter‑productive, and tools like Dask, datatable, or RAPIDS can be considered for further acceleration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Big Data Python parallel processing tqdm Multiprocessing joblib

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.