Big Data 14 min read

Python Data Parsing and Large‑Scale Data Processing Techniques

This article introduces Python's built‑in modules and popular libraries for parsing CSV, JSON, and XML files, demonstrates advanced data manipulation with pandas, and presents multiple strategies—including chunked reading, Dask, PySpark, HDF5, databases, Vaex, and NumPy memory‑mapping—for efficiently handling very large datasets.

Test Development Learning Exchange

Nov 2, 2024

Python Data Parsing and Large‑Scale Data Processing Techniques

In Python, data parsing typically involves reading, cleaning, transforming, and analyzing data, and the language provides several standard libraries for handling common formats such as CSV, JSON, XML, and databases.

1. Using the csv module to parse CSV files

The CSV format is widely used for tabular data. Python's standard library includes the csv module for reading and writing CSV files.

import csv
# Read CSV file
with open('example.csv', mode='r', newline='', encoding='utf-8') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)
# Write CSV file
data = [
    ['Name', 'Age'],
    ['Alice', 24],
    ['Bob', 19]
]
with open('output.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerows(data)

2. Using the json module to parse JSON data

JSON is a lightweight data‑exchange format. The standard json module can convert between JSON strings and Python objects.

import json
# Parse JSON string
json_string = '{"name": "John", "age": 30, "city": "New York"}'
data = json.loads(json_string)
print(data)  # Output: {'name': 'John', 'age': 30, 'city': 'New York'}
# Convert Python object to JSON string
person = {"name": "Jane", "age": 25, "city": "London"}
json_data = json.dumps(person)
print(json_data)  # Output: {"name": "Jane", "age": 25, "city": "London"}

3. Using xml.etree.ElementTree to parse XML

XML is a markup language used for configuration files and data exchange. The xml.etree.ElementTree module provides simple XML parsing capabilities.

import xml.etree.ElementTree as ET
# Parse XML string
xml_string = '''<root><child name="first">Text</child></root>'''
root = ET.fromstring(xml_string)
# Traverse XML nodes
for child in root:
    print(f"Child: {child.attrib['name']}, Text: {child.text}")
# Create XML and write to file
new_root = ET.Element("root")
ET.SubElement(new_root, "child", name="third").text = "More text"
tree = ET.ElementTree(new_root)
tree.write("output.xml")

4. Using pandas for more complex data operations pandas is a powerful data‑analysis library that offers rich data structures and tools for handling tabular data.

pip install pandas

import pandas as pd
# Read CSV into DataFrame
df = pd.read_csv('example.csv')
print(df.head())
# Select a column
ages = df['Age']
print(ages)
# Filter rows
young_people = df[df['Age'] < 20]
print(young_people)
# Write to a new CSV file
df.to_csv('filtered_data.csv', index=False)

How to handle large datasets?

1. Pandas chunked reading : Use the chunksize parameter of read_csv or read_json to process files piece‑by‑piece.

import pandas as pd
chunk_size = 10000
for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
    # Process each chunk
    print(chunk.head())

2. Dask : A parallel‑computing library with a Pandas‑like API that can operate on datasets larger than memory.

pip install dask[complete]

import dask.dataframe as dd
# Read large CSV file
df = dd.read_csv('large_dataset.csv')
# Perform operations
result = df.groupby('column_name').sum().compute()
print(result)

3. PySpark : The Python API for Apache Spark, suitable for distributed processing of massive data.

pip install pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LargeDataProcessing").getOrCreate()
# Read large CSV file
df = spark.read.csv('large_dataset.csv', header=True, inferSchema=True)
# Perform operations
result = df.groupBy('column_name').sum('another_column')
result.show()
# Stop SparkSession
spark.stop()

4. HDF5 format with h5py for efficient I/O of scientific data.

pip install h5py

import h5py
# Write HDF5 file
with h5py.File('data.h5', 'w') as f:
    f.create_dataset('dataset_1', data=[1, 2, 3, 4, 5])
# Read HDF5 file
with h5py.File('data.h5', 'r') as f:
    dataset = f['dataset_1']
    print(dataset[:])

5. Databases (e.g., SQLite, PostgreSQL, MongoDB) for very large datasets.

import sqlite3
import pandas as pd
# Connect to SQLite database
conn = sqlite3.connect('large_dataset.db')
# Write DataFrame to database
df.to_sql('table_name', conn, if_exists='replace', index=False)
# Query data
query = "SELECT * FROM table_name WHERE column_name > 100"
result = pd.read_sql(query, conn)
conn.close()
print(result)

6. Vaex : High‑performance library for out‑of‑core DataFrame operations.

pip install vaex

import vaex
# Read large CSV file
df = vaex.from_csv('large_dataset.csv', convert=True, chunk_size=10_000_000)
# Group by and aggregate
result = df.groupby(by='column_name', agg={'count': vaex.agg.count()}).sort('count', ascending=False)
print(result)

7. NumPy memory‑mapping for large arrays.

import numpy as np
# Create memory‑mapped array
shape = (10000, 10000)
dtype = np.float32
mmapped_array = np.memmap('my_array.memmap', dtype=dtype, mode='w+', shape=shape)
mmapped_array[:] = np.random.randn(*shape)
# Read memory‑mapped array
mmapped_array = np.memmap('my_array.memmap', dtype=dtype, mode='r', shape=shape)
print(mmapped_array[:10, :10])

8. Dask‑ML for scalable machine‑learning tasks.

import dask_ml.datasets
import dask_ml.cluster
# Generate a large synthetic dataset
X, y = dask_ml.datasets.make_blobs(n_samples=1000000, chunks=100000, random_state=0, centers=3)
# KMeans clustering
kmeans = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
kmeans.fit(X)
print(kmeans.labels_)

How to choose the right data‑processing method?

1. Determine data scale: use Pandas for small‑to‑medium data, Dask/Vaex/PySpark for very large data.

2. Consider data type and structure: tabular data fits Pandas/Dask/Vaex; unstructured data may need specialized libraries.

3. Assess computing resources: single‑machine vs. distributed clusters.

4. Clarify specific needs: cleaning, analysis, visualization, machine learning, or real‑time processing.

5. Optimize performance: memory‑mapping, chunked reads, efficient I/O formats (HDF5, databases).

6. Prototype and benchmark on representative data to validate the chosen approach.

Concrete example

Suppose you have a 10 GB CSV file that requires cleaning, analysis, and machine‑learning modeling. A possible workflow is:

import pandas as pd
# Quick exploration
df = pd.read_csv('large_dataset.csv', nrows=1000)
print(df.head())

Chunked reading and cleaning:

chunk_size = 100000
for chunk in pd.read_csv('large_dataset.csv', chunksize=chunk_size):
    cleaned_chunk = clean_data(chunk)
    cleaned_chunk.to_csv('cleaned_data.csv', mode='a', header=False, index=False)

Further processing with Dask:

import dask.dataframe as dd
ddf = dd.read_csv('cleaned_data.csv')
result = ddf.groupby('column_name').sum().compute()
print(result)

Machine‑learning modeling using Dask‑ML:

from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression
X = ddf.drop('target_column', axis=1)
y = ddf['target_column']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Python json XML CSV Pandas Data Parsing

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.