Python Data Parsing and Large‑Scale Data Processing Techniques
This article introduces Python's built‑in modules and popular libraries for parsing CSV, JSON, and XML files, demonstrates advanced data manipulation with pandas, and presents multiple strategies—including chunked reading, Dask, PySpark, HDF5, databases, Vaex, and NumPy memory‑mapping—for efficiently handling very large datasets.
In Python, data parsing typically involves reading, cleaning, transforming, and analyzing data, and the language provides several standard libraries for handling common formats such as CSV, JSON, XML, and databases.
1. Using the csv module to parse CSV files
The CSV format is widely used for tabular data. Python's standard library includes the csv module for reading and writing CSV files.
import csv
# Read CSV file
with open('example.csv', mode='r', newline='', encoding='utf-8') as file:
reader = csv.reader(file)
for row in reader:
print(row)
# Write CSV file
data = [
['Name', 'Age'],
['Alice', 24],
['Bob', 19]
]
with open('output.csv', mode='w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerows(data)2. Using the json module to parse JSON data
JSON is a lightweight data‑exchange format. The standard json module can convert between JSON strings and Python objects.
import json
# Parse JSON string
json_string = '{"name": "John", "age": 30, "city": "New York"}'
data = json.loads(json_string)
print(data) # Output: {'name': 'John', 'age': 30, 'city': 'New York'}
# Convert Python object to JSON string
person = {"name": "Jane", "age": 25, "city": "London"}
json_data = json.dumps(person)
print(json_data) # Output: {"name": "Jane", "age": 25, "city": "London"}3. Using xml.etree.ElementTree to parse XML
XML is a markup language used for configuration files and data exchange. The xml.etree.ElementTree module provides simple XML parsing capabilities.
import xml.etree.ElementTree as ET
# Parse XML string
xml_string = '''
Text
'''
root = ET.fromstring(xml_string)
# Traverse XML nodes
for child in root:
print(f"Child: {child.attrib['name']}, Text: {child.text}")
# Create XML and write to file
new_root = ET.Element("root")
ET.SubElement(new_root, "child", name="third").text = "More text"
tree = ET.ElementTree(new_root)
tree.write("output.xml")4. Using pandas for more complex data operations
pandas is a powerful data‑analysis library that offers rich data structures and tools for handling tabular data.
pip install pandas import pandas as pd
# Read CSV into DataFrame
df = pd.read_csv('example.csv')
print(df.head())
# Select a column
ages = df['Age']
print(ages)
# Filter rows
young_people = df[df['Age'] < 20]
print(young_people)
# Write to a new CSV file
df.to_csv('filtered_data.csv', index=False)How to handle large datasets?
1. Pandas chunked reading : Use the chunksize parameter of read_csv or read_json to process files piece‑by‑piece.
import pandas as pd
chunk_size = 10000
for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
# Process each chunk
print(chunk.head())2. Dask : A parallel‑computing library with a Pandas‑like API that can operate on datasets larger than memory.
pip install dask[complete] import dask.dataframe as dd
# Read large CSV file
df = dd.read_csv('large_dataset.csv')
# Perform operations
result = df.groupby('column_name').sum().compute()
print(result)3. PySpark : The Python API for Apache Spark, suitable for distributed processing of massive data.
pip install pyspark from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LargeDataProcessing").getOrCreate()
# Read large CSV file
df = spark.read.csv('large_dataset.csv', header=True, inferSchema=True)
# Perform operations
result = df.groupBy('column_name').sum('another_column')
result.show()
# Stop SparkSession
spark.stop()4. HDF5 format with h5py for efficient I/O of scientific data.
pip install h5py import h5py
# Write HDF5 file
with h5py.File('data.h5', 'w') as f:
f.create_dataset('dataset_1', data=[1, 2, 3, 4, 5])
# Read HDF5 file
with h5py.File('data.h5', 'r') as f:
dataset = f['dataset_1']
print(dataset[:])5. Databases (e.g., SQLite, PostgreSQL, MongoDB) for very large datasets.
import sqlite3
import pandas as pd
# Connect to SQLite database
conn = sqlite3.connect('large_dataset.db')
# Write DataFrame to database
df.to_sql('table_name', conn, if_exists='replace', index=False)
# Query data
query = "SELECT * FROM table_name WHERE column_name > 100"
result = pd.read_sql(query, conn)
conn.close()
print(result)6. Vaex : High‑performance library for out‑of‑core DataFrame operations.
pip install vaex import vaex
# Read large CSV file
df = vaex.from_csv('large_dataset.csv', convert=True, chunk_size=10_000_000)
# Group by and aggregate
result = df.groupby(by='column_name', agg={'count': vaex.agg.count()}).sort('count', ascending=False)
print(result)7. NumPy memory‑mapping for large arrays.
import numpy as np
# Create memory‑mapped array
shape = (10000, 10000)
dtype = np.float32
mmapped_array = np.memmap('my_array.memmap', dtype=dtype, mode='w+', shape=shape)
mmapped_array[:] = np.random.randn(*shape)
# Read memory‑mapped array
mmapped_array = np.memmap('my_array.memmap', dtype=dtype, mode='r', shape=shape)
print(mmapped_array[:10, :10])8. Dask‑ML for scalable machine‑learning tasks.
import dask_ml.datasets
import dask_ml.cluster
# Generate a large synthetic dataset
X, y = dask_ml.datasets.make_blobs(n_samples=1000000, chunks=100000, random_state=0, centers=3)
# KMeans clustering
kmeans = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
kmeans.fit(X)
print(kmeans.labels_)How to choose the right data‑processing method?
1. Determine data scale: use Pandas for small‑to‑medium data, Dask/Vaex/PySpark for very large data.
2. Consider data type and structure: tabular data fits Pandas/Dask/Vaex; unstructured data may need specialized libraries.
3. Assess computing resources: single‑machine vs. distributed clusters.
4. Clarify specific needs: cleaning, analysis, visualization, machine learning, or real‑time processing.
5. Optimize performance: memory‑mapping, chunked reads, efficient I/O formats (HDF5, databases).
6. Prototype and benchmark on representative data to validate the chosen approach.
Concrete example
Suppose you have a 10 GB CSV file that requires cleaning, analysis, and machine‑learning modeling. A possible workflow is:
import pandas as pd
# Quick exploration
df = pd.read_csv('large_dataset.csv', nrows=1000)
print(df.head())Chunked reading and cleaning:
chunk_size = 100000
for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
cleaned_chunk = clean_data(chunk)
cleaned_chunk.to_csv('cleaned_data.csv', mode='a', header=False, index=False)Further processing with Dask:
import dask.dataframe as dd
ddf = dd.read_csv('cleaned_data.csv')
result = ddf.groupby('column_name').sum().compute()
print(result)Machine‑learning modeling using Dask‑ML:
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression
X = ddf.drop('target_column', axis=1)
y = ddf['target_column']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.