Master Memory‑Efficient Techniques for Processing Massive Files in Python
This guide explains how to read and process files that exceed available memory by using line‑by‑line iteration, chunked reads, memory‑mapped files, generators, streaming decompression, parallel execution, and specialized libraries such as Dask and PyTables, while providing practical code examples and performance tips.
Basic method: line‑by‑line reading
The simplest approach is to iterate over the file object; Python handles buffering automatically.
with open('large_file.txt', 'r', encoding='utf-8') as f:
for line in f: # line‑by‑line, memory‑friendly
process_line(line) # handle each lineAlternatively, use readline() in a loop.
with open('large_file.txt', 'r') as f:
while True:
line = f.readline()
if not line: # end of file
break
process_line(line)Chunked reading
For binary or non‑text files, read fixed‑size blocks.
BUFFER_SIZE = 1024 * 1024 # 1 MB buffer
with open('large_file.bin', 'rb') as f:
while True:
chunk = f.read(BUFFER_SIZE)
if not chunk: # end of file
break
process_chunk(chunk)A more Pythonic version uses iter with functools.partial:
from functools import partial
chunk_size = 1024 * 1024 # 1 MB
with open('large_file.bin', 'rb') as f:
for chunk in iter(partial(f.read, chunk_size), b''):
process_chunk(chunk)Memory‑mapped files (mmap)
When random access is needed, map the file into memory.
import mmap
with open('large_file.bin', 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
print(mm[:100]) # first 100 bytes
index = mm.find(b'some_pattern')
if index != -1:
print(f"Found at position {index}")
mm.close()Generator‑based processing
Encapsulate reading logic in a generator to keep memory usage low.
def read_large_file(file_path, chunk_size=1024*1024):
"""Yield successive chunks from a large file"""
with open(file_path, 'rb') as f:
while True:
chunk = f.read(chunk_size)
if not chunk:
break
yield chunk
for chunk in read_large_file('huge_file.bin'):
process_chunk(chunk)Processing compressed files
Stream‑decompress gzip files:
import gzip, shutil
with gzip.open('large_file.gz', 'rb') as f_in:
with open('large_file_extracted', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out) # stream copyAnd zip files:
import zipfile
with zipfile.ZipFile('large_file.zip', 'r') as z:
with z.open('file_inside.zip') as f:
for line in f:
process_line(line)Multithreading / Multiprocessing
Parallelize chunk processing with a thread pool.
from concurrent.futures import ThreadPoolExecutor
import os
def process_chunk(start, end, file_path):
"""Process a specific file segment"""
with open(file_path, 'rb') as f:
f.seek(start)
chunk = f.read(end - start)
# handle chunk
def parallel_file_processing(file_path, num_threads=4):
file_size = os.path.getsize(file_path)
chunk_size = file_size // num_threads
with ThreadPoolExecutor(max_workers=num_threads) as executor:
futures = []
for i in range(num_threads):
start = i * chunk_size
end = start + chunk_size if i != num_threads - 1 else file_size
futures.append(executor.submit(process_chunk, start, end, file_path))
for future in concurrent.futures.as_completed(futures):
future.result()Third‑party libraries
Use Dask for out‑of‑core DataFrames:
import dask.dataframe as dd
df = dd.read_csv('very_large_file.csv', blocksize=25e6) # 25 MB per partition
result = df.groupby('column').mean().compute()Or PyTables for HDF5 files:
import tables
h5file = tables.open_file('large_data.h5', mode='r')
for row in h5file.root.data.table.iterrows():
process_row(row)
h5file.close()Database alternatives
Load large CSV data into SQLite for efficient querying.
import sqlite3
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
cursor.execute('CREATE TABLE data (col1, col2, col3)')
with open('large_file.csv') as f:
data_gen = (line.strip().split(',') for line in f)
cursor.executemany('INSERT INTO data VALUES (?, ?, ?)', data_gen)
conn.commit()Performance optimisation tips
Buffer sizes between 8 KB and 1 MB usually give the best trade‑off; experiment to find the optimal value.
Binary mode ('rb') is generally faster than text mode.
Operating systems cache frequently accessed file regions, making subsequent reads faster.
Filter unnecessary data early and use generators to stay memory‑efficient.
Complete example: processing a huge CSV file
import csv
from collections import namedtuple
from itertools import islice
def process_large_csv(file_path, batch_size=10000):
"""Process a large CSV in batches"""
CSVRow = namedtuple('CSVRow', ['id', 'name', 'value'])
with open(file_path, 'r', encoding='utf-8') as f:
reader = csv.reader(f)
next(reader) # skip header
while True:
batch = list(islice(reader, batch_size))
if not batch:
break
rows = [CSVRow(*row) for row in batch]
process_batch(rows)
print(f"Processed {len(batch)} rows")
def process_batch(rows):
"""Placeholder for batch processing logic"""
pass
process_large_csv('huge_dataset.csv')Conclusion
The key principles for handling large files are to avoid loading the entire file into memory, choose an appropriate I/O strategy (line‑by‑line, chunked, or memory‑mapped), consider parallel execution when CPU‑bound, leverage generators for memory efficiency, and employ specialized tools such as Dask or PyTables when the data size exceeds what simple streaming can handle.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
