Cloud Computing 6 min read

Using AWS S3 for Data Storage and Processing with Python

This tutorial guides readers through creating an AWS account, configuring the AWS CLI, uploading and downloading CSV files to S3, and using Python with pandas to clean, deduplicate, and aggregate the data before re‑uploading the results, illustrating end‑to‑end cloud data processing.

Test Development Learning Exchange

Dec 4, 2024

Using AWS S3 for Data Storage and Processing with Python

Goal : Learn how to store data in AWS S3, download it, and process it.

Learning Content : Basic concepts of AWS S3 and a brief overview of AWS EC2.

Practice Steps :

Create an AWS account and configure the AWS CLI.

Upload a CSV file to an S3 bucket.

Download the CSV file from S3 and process it with Python.

Upload the processed data back to S3.

1. Install and configure AWS CLI

pip install awscli

aws configure

2. Upload a CSV file to S3

# Create S3 bucket
aws s3 mb s3://your-bucket-name
# Upload file to S3
aws s3 cp sales_data.csv s3://your-bucket-name/

3. Download and process the CSV file with Python

import boto3
import pandas as pd
# Configure S3 client
s3_client = boto3.client('s3')
bucket_name = 'your-bucket-name'
file_name = 'sales_data.csv'
local_file_path = 'downloaded_sales_data.csv'
# Download file
s3_client.download_file(bucket_name, file_name, local_file_path)
# Read CSV
df = pd.read_csv(local_file_path, encoding='utf-8-sig')
print(f"Original dataset:
{df.head()}")
# Data cleaning
missing_values = df.isnull().sum()
print(f"Missing values per column:
{missing_values}")
df_cleaned = df.dropna()
print(f"After dropping missing values:
{df_cleaned.head()}")
duplicates = df_cleaned.duplicated()
print(f"Duplicate rows:
{duplicates}")
df_no_duplicates = df_cleaned.drop_duplicates()
print(f"After dropping duplicates:
{df_no_duplicates.head()}")
# Group by department and calculate mean sales
grouped_by_department = df_no_duplicates.groupby('部门')
mean_sales_by_department = grouped_by_department['总价'].mean()
print(f"Mean sales by department:
{mean_sales_by_department}")
# Save result
result_file_path = 'mean_sales_by_department.csv'
mean_sales_by_department.to_csv(result_file_path, encoding='utf-8-sig')
print(f"Result saved to {result_file_path}")

4. Upload the processed file back to S3

s3_client.upload_file(result_file_path, bucket_name, 'mean_sales_by_department.csv')
print(f"Processed data uploaded to s3://{bucket_name}/mean_sales_by_department.csv")

Summary : After completing the steps, you should be able to create an AWS account, configure the CLI, upload a CSV to S3, download and process it with Python (handling missing values, duplicates, and aggregating sales by department), and finally upload the cleaned results back to S3.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CLI Cloud Computing Python data processing AWS S3

Written by

Test Development Learning Exchange

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.