Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn

This guide walks through loading CSV data with pandas, cleaning missing values, filtering, grouping, visualizing, performing correlation and time‑series analysis, detecting outliers, and applying linear and logistic regression models using scikit‑learn, all illustrated with complete Python code snippets.

Test Development Learning Exchange
Test Development Learning Exchange
Test Development Learning Exchange
Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn

1. Data Loading and Preview Scenario: Load data from a CSV file and view the first few rows.

import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# View first 5 rows
print(df.head())

2. Data Cleaning: Missing Value Handling Scenario: Fill missing values in the 'Age' column with the column mean.

mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)

3. Data Filtering Scenario: Select rows where age is greater than 30. filtered_df = df[df['Age'] > 30] 4. Grouping and Aggregation Scenario: Compute the average age for each gender.

grouped = df.groupby('Gender')['Age'].mean()
print(grouped)

5. Data Visualization: Bar Chart Scenario: Plot a bar chart of user counts by gender.

import matplotlib.pyplot as plt
gender_counts = df['Gender'].value_counts()
gender_counts.plot(kind='bar')
plt.title('User Count by Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

6. Correlation Analysis Scenario: Compute Pearson correlation coefficients between numeric variables.

correlation_matrix = df.corr()
print(correlation_matrix)

7. Time Series Analysis Scenario: Plot monthly sales from a date column.

df['Date'] = pd.to_datetime(df['Date'])  # assume a date column exists
df.set_index('Date', inplace=True)
monthly_sales = df['Sales'].resample('M').sum()
monthly_sales.plot()
plt.title('Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()

8. Outlier Detection Scenario: Identify outliers in the 'Age' column using the IQR method.

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]
print(outliers)

9. Simple Linear Regression Scenario: Analyze the relationship between advertising spend and sales.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = df[['Advertising_Spend']]
y = df['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)

10. Classification Task: Logistic Regression Scenario: Predict whether a user will purchase a product based on features.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Assume we have features X and target y
X = df[['Age', 'Income', 'Gender']]
y = df['Will_Purchase']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# Predict probabilities
purchase_probabilities = logreg.predict_proba(X_test)
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningdata cleaningvisualizationpandasscikit-learn
Test Development Learning Exchange
Written by

Test Development Learning Exchange

Test Development Learning Exchange

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.