Master Data Cleaning: 5 Essential SQL & Python Techniques for Real-World Datasets
This article walks through five common data‑cleaning scenarios—dropping/renaming columns, handling duplicates and nulls, string manipulation, merging tables, and window‑function ranking—showing practical SQL and Python code that can be applied to large‑scale data warehouses.
In daily analytics work, raw tables are often messy and contain redundant or noisy features, so cleaning them into tidy feature tables is crucial. While Pandas is handy for small datasets, enterprise data warehouses usually require HiveSQL or MySQL, prompting the need for SQL‑based cleaning solutions.
01 Delete Columns & Rename Columns
Often only a subset of columns is useful for analysis; unnecessary columns should be removed, and long column names can be shortened for readability.
删除列Python版:
df.drop(col_names, axis=1, inplace=True)
删除列SQL版:
1、select col_names from Table_Name
2、alter table tableName drop column columnName
重命名列Python版:
df.rename(index={'row1':'A'}, columns={'col1':'B'})
重命名列SQL版:
select col_names as col_name_B from Table_Name02 Duplicate & Missing Value Handling
Duplicate records inflate counts, and NULLs propagate through calculations, so both need to be addressed.
重复值处理Python版:
df.drop_duplicates()
重复值处理SQL版:
1、select distinct col_name from Table_Name
2、select col_name from Table_Name group by col_name
缺失值处理Python版:
df.fillna(value = 0)
df1.combine_first(df2)
缺失值处理SQL版:
1、select ifnull(col_name,0) value from Table_Name
2、select coalesce(col_name,col_name_A,0) as value from Table_Name
3、select case when col_name is null then 0 else col_name end from Table_Name03 String Cleaning, Splitting & Concatenation
Textual feedback often contains extra spaces, garbage characters, and needs to be split or concatenated for analysis.
字符串处理Python版:
## 1、空格处理
df[col_name] = df[col_name].str.lstrip()
## 2、*%d等垃圾符处理
df[col_name].replace(' &#.*', '', regex=True, inplace=True)
## 3、字符串分割
df[col_name].str.split('分割符')
## 4、字符串拼接
df[col_name].str.cat()
字符串处理SQL版:
## 1、空格处理
select ltrim(col_name) from Table_name
## 2、*%d等垃圾符处理
select regexp_replace(col_name, 正则表达式) from Table_name
## 3、字符串分割
select split(col_name,'分割符') from Table_name
## 4、字符串拼接
select concat_ws(col_name,'拼接符') from Table_name04 Merge Operations
When features are stored across multiple tables, they need to be combined via joins or unions.
合并处理Python版:
左右合并
1、pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
2、pd.concat([df1,df2])
上下合并
df1.append(df2, ignore_index=True, sort=False)
合并处理SQL版:
左右合并
select A.*,B.* from Table_a A join Table_b B on A.id = B.id
select A.* from Table_a A left join Table_b B on A.id = B.id
上下合并
-- Union:对两个结果集进行并集操作,不包括重复行,同时进行默认规则的排序;
-- Union All:对两个结果集进行并集操作,包括重复行,不进行排序;
select A.* from Table_a A
union
select B.* from Table_b B
# Union 因为会将各查询子集的记录做比较,故比起 Union All 通常速度会慢。如果 Union All 能满足需求,建议使用 Union All。05 Window Functions for Ranking
To identify top‑selling categories per store, use window functions to rank sales within each store.
窗口分组Python版:
df['Rank'] = df.groupby(by=['Sale_store'])['Sale_Num'].transform(lambda x: x.rank(ascending=False))
窗口分组SQL版:
select
*
from
(
Select
*,
row_number() over(partition by Sale_store order by Sale_Num desc) rk
from
table_name
) b where b.rk = 1The examples demonstrate that mastering these five cleaning patterns covers roughly 90% of pre‑modeling data preparation tasks. Whether using Python or SQL, being proficient with SQL enables fast, scalable feature cleaning in large data warehouses.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
