Databases 9 min read

Master Data Cleaning: 5 Essential SQL & Python Techniques for Real-World Datasets

This article walks through five common data‑cleaning scenarios—dropping/renaming columns, handling duplicates and nulls, string manipulation, merging tables, and window‑function ranking—showing practical SQL and Python code that can be applied to large‑scale data warehouses.

Python Crawling & Data Mining
Python Crawling & Data Mining
Python Crawling & Data Mining
Master Data Cleaning: 5 Essential SQL & Python Techniques for Real-World Datasets

In daily analytics work, raw tables are often messy and contain redundant or noisy features, so cleaning them into tidy feature tables is crucial. While Pandas is handy for small datasets, enterprise data warehouses usually require HiveSQL or MySQL, prompting the need for SQL‑based cleaning solutions.

01 Delete Columns & Rename Columns

Often only a subset of columns is useful for analysis; unnecessary columns should be removed, and long column names can be shortened for readability.

删除列Python版:
df.drop(col_names, axis=1, inplace=True)

删除列SQL版:
1、select col_names from Table_Name
2、alter table tableName drop column columnName

重命名列Python版:
df.rename(index={'row1':'A'}, columns={'col1':'B'})

重命名列SQL版:
select col_names as col_name_B from Table_Name

02 Duplicate & Missing Value Handling

Duplicate records inflate counts, and NULLs propagate through calculations, so both need to be addressed.

重复值处理Python版:
df.drop_duplicates()

重复值处理SQL版:
1、select distinct col_name from Table_Name
2、select col_name from Table_Name group by col_name

缺失值处理Python版:
df.fillna(value = 0)

df1.combine_first(df2)

缺失值处理SQL版:
1、select ifnull(col_name,0) value from Table_Name
2、select coalesce(col_name,col_name_A,0) as value from Table_Name
3、select case when col_name is null then 0 else col_name end from Table_Name

03 String Cleaning, Splitting & Concatenation

Textual feedback often contains extra spaces, garbage characters, and needs to be split or concatenated for analysis.

字符串处理Python版:
## 1、空格处理
df[col_name] = df[col_name].str.lstrip()

## 2、*%d等垃圾符处理
df[col_name].replace(' &#.*', '', regex=True, inplace=True)

## 3、字符串分割
df[col_name].str.split('分割符')

## 4、字符串拼接
df[col_name].str.cat()

字符串处理SQL版:
## 1、空格处理
select ltrim(col_name) from Table_name

## 2、*%d等垃圾符处理
select regexp_replace(col_name, 正则表达式) from Table_name

## 3、字符串分割
select split(col_name,'分割符') from Table_name

## 4、字符串拼接
select concat_ws(col_name,'拼接符') from Table_name

04 Merge Operations

When features are stored across multiple tables, they need to be combined via joins or unions.

合并处理Python版:

左右合并
1、pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
          left_index=False, right_index=False, sort=True,
          suffixes=('_x', '_y'), copy=True, indicator=False,
          validate=None)
2、pd.concat([df1,df2])

上下合并
df1.append(df2, ignore_index=True, sort=False)

合并处理SQL版:

左右合并
select A.*,B.* from Table_a A join Table_b B on A.id = B.id

select A.* from Table_a A left join Table_b B on A.id = B.id

上下合并
-- Union:对两个结果集进行并集操作,不包括重复行,同时进行默认规则的排序;
-- Union All:对两个结果集进行并集操作,包括重复行,不进行排序;

select A.* from Table_a A
union
select B.* from Table_b B

# Union 因为会将各查询子集的记录做比较,故比起 Union All 通常速度会慢。如果 Union All 能满足需求,建议使用 Union All。

05 Window Functions for Ranking

To identify top‑selling categories per store, use window functions to rank sales within each store.

窗口分组Python版:
df['Rank'] = df.groupby(by=['Sale_store'])['Sale_Num'].transform(lambda x: x.rank(ascending=False))

窗口分组SQL版:
select
  *
from
  (
    Select
      *,
      row_number() over(partition by Sale_store order by Sale_Num desc) rk
    from
      table_name
  ) b where b.rk = 1

The examples demonstrate that mastering these five cleaning patterns covers roughly 90% of pre‑modeling data preparation tasks. Whether using Python or SQL, being proficient with SQL enables fast, scalable feature cleaning in large data warehouses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonSQLdata analysisdata preprocessing
Python Crawling & Data Mining
Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.