Step-by-Step Data Analysis and Machine Learning Workflow with Pandas, Matplotlib, and Scikit-learn
This guide walks through loading CSV data with pandas, cleaning missing values, filtering, grouping, visualizing, performing correlation and time‑series analysis, detecting outliers, and applying linear and logistic regression models using scikit‑learn, all illustrated with complete Python code snippets.
1. Data Loading and Preview Scenario: Load data from a CSV file and view the first few rows.
import pandas as pd # Load data df = pd.read_csv('data.csv') # View first 5 rows print(df.head())2. Data Cleaning: Missing Value Handling Scenario: Fill missing values in the 'Age' column with the column mean.
mean_age = df['Age'].mean() df['Age'].fillna(mean_age, inplace=True)3. Data Filtering Scenario: Select rows where age is greater than 30. filtered_df = df[df['Age'] > 30] 4. Grouping and Aggregation Scenario: Compute the average age for each gender.
grouped = df.groupby('Gender')['Age'].mean() print(grouped)5. Data Visualization: Bar Chart Scenario: Plot a bar chart of user counts by gender.
import matplotlib.pyplot as plt gender_counts = df['Gender'].value_counts() gender_counts.plot(kind='bar') plt.title('User Count by Gender') plt.xlabel('Gender') plt.ylabel('Count') plt.show()6. Correlation Analysis Scenario: Compute Pearson correlation coefficients between numeric variables.
correlation_matrix = df.corr() print(correlation_matrix)7. Time Series Analysis Scenario: Plot monthly sales from a date column.
df['Date'] = pd.to_datetime(df['Date']) # assume a date column exists df.set_index('Date', inplace=True) monthly_sales = df['Sales'].resample('M').sum() monthly_sales.plot() plt.title('Monthly Sales') plt.xlabel('Month') plt.ylabel('Sales') plt.show()8. Outlier Detection Scenario: Identify outliers in the 'Age' column using the IQR method.
Q1 = df['Age'].quantile(0.25) Q3 = df['Age'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)] print(outliers)9. Simple Linear Regression Scenario: Analyze the relationship between advertising spend and sales.
from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression X = df[['Advertising_Spend']] y = df['Sales'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) # Predict predictions = model.predict(X_test)10. Classification Task: Logistic Regression Scenario: Predict whether a user will purchase a product based on features.
from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Assume we have features X and target y X = df[['Age', 'Income', 'Gender']] y = df['Will_Purchase'] scaler = StandardScaler() X_scaled = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) logreg = LogisticRegression() logreg.fit(X_train, y_train) # Predict probabilities purchase_probabilities = logreg.predict_proba(X_test)Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
