Fundamentals 10 min read

Master Pandas: From Import to Data Cleaning in One Comprehensive Guide

This tutorial walks through essential pandas operations—including importing modules, building a sample shopping dataset, reading and writing CSV files, inspecting data structures, and performing thorough data cleaning such as handling missing values, trimming spaces, case conversion, replacements, deletions, duplicate removal, type casting, and column renaming—complete with code snippets and visual results.

Python Crawling & Data Mining

May 19, 2020

Master Pandas: From Import to Data Cleaning in One Comprehensive Guide

To deepen data‑analysis skills, this article summarizes the most commonly used pandas functions, providing clear explanations, official documentation links, and a mind‑map for quick reference.

1. Import Modules

import pandas as pd  # pandas
import numpy as np   # numpy

2. Create Dataset and Read/Write

2.1 Create Dataset

A sample supermarket shopping dataset is constructed with columns: id, date, money, product, department, origin.

# List and dict can be passed to DataFrame; here a dict is used:

data = pd.DataFrame({
    "id": np.arange(101, 111),
    "date": pd.date_range(start="20200310", periods=10),
    "money": [5, 4, 65, -10, 15, 20, 35, 16, 6, 20],
    "product": ['苏打水','可乐','牛肉干','老干妈','菠萝','冰激凌','洗面奶','洋葱','牙膏','薯片'],
    "department": ['饮料','饮料','零食','调味品','水果',np.nan,'日用品','蔬菜','日用品','零食'],
    "origin": ['China',' China','America','China','Thailand','China','america','China','China','Japan']
})

data  # display the dataset

Result:

2.2 Write and Read CSV

data.to_csv("shopping.csv", index=False)  # Do not write index

data = pd.read_csv("shopping.csv")

3. Data Inspection

3.1 Basic Information

data.shape            # rows, columns
data.dtypes           # data types of all columns
data['id'].dtype      # data type of a specific column
data.ndim             # number of dimensions
data.index            # row index
data.columns          # column index
data.values           # underlying numpy array

3.2 Overall View

data.head()    # first 5 rows
data.tail()    # last 5 rows
data.info()    # summary of index, dtypes, non‑null counts, memory usage
data.describe()# statistical summary

4. Data Cleaning

4.1 Detect Anomalies

Iterate through columns to list unique values and spot issues such as negative money, missing department, and inconsistent case in origin.

for col in data:
    print(col + ": " + str(data[col].unique()))  # show unique values

Result shows a negative value in money, a NaN in department, and case mismatches in origin.

4.2 Missing‑Value Handling

4.2.1 Detection

data.isnull()                     # whole DataFrame
data['department'].isnull()      # specific column

4.2.2 Summarize Missing Values

data.isnull().sum().sort_values(ascending=False)

4.2.3 Fill Missing Values

# Forward fill for department
data['department'].fillna(method="ffill", inplace=True)
# Backward fill for department
data['department'].fillna(method="bfill", inplace=True)
# Fill with a specific value
data['department'].fillna(value="冷冻食品", inplace=True)

4.3 Trim Spaces

for col in data:
    if pd.api.types.is_object_dtype(data[col]):
        data[col] = data[col].str.strip()

# Verify
data['origin'].unique()

Result: array(['China', 'America', 'Thailand', 'america', 'Japan'], dtype=object)

4.4 Case Conversion

data['origin'].str.title()      # Capitalize first letters
data['origin'].str.capitalize()
data['origin'].str.upper()      # Upper case
data['origin'].str.lower()      # Lower case

4.5 Replace Values

# Correct case in origin
data['origin'].replace("america", "America", inplace=True)
# Replace negative money with NaN, then fill with mean
data['money'].replace(-10, np.nan, inplace=True)
data['money'].replace(np.nan, data['money'].mean(), inplace=True)

4.6 Delete Rows

Method 1 – filter rows:

data1 = data[data.origin != 'American']
data2 = data[(data != 'Japan').all(axis=1)]

Method 2 – drop duplicates:

# Keep first occurrence
data['origin'].drop_duplicates()
# Keep last occurrence
data['origin'].drop_duplicates(keep='last')

4.7 Type Conversion

data['id'].astype('str')  # convert id column to string

4.8 Rename Columns

data.rename(columns={'id':'ID', 'origin':'产地'}, inplace=True)

Mind‑Map Overview

References

pandas official documentation

Pandas usage summary articles

Pandas text‑data methods

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python data analysis tutorial dataframe

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.