Data Preprocessing and Statistical Analysis Techniques in Python
The article reviews essential Python data‑preprocessing and statistical‑analysis tools—including missing‑value imputation, outlier trimming, scaling, binning, knee‑point detection, correlation, chi‑square testing, linear regression, Wilson scoring, PCA weighting, text tokenization and sentiment analysis, plus visualization with matplotlib/seaborn and big‑data access via pyodps.
This article introduces common data‑preprocessing and statistical‑analysis methods used in data‑science projects, focusing on practical Python implementations.
Missing‑value handling : mean/median/mode imputation, fixed‑value fill, nearest‑neighbor, regression, and row deletion. Example:
# Check for missing values
print(data.info(), '
')
# Drop rows with any missing value
data2 = data.dropna(axis=0)Outlier handling : IQR‑based trimming and capping.
data = np.array(data)
q1 = np.quantile(data, 0.25)
q3 = np.quantile(data, 0.75)
low = q1 - 1.5 * (q3 - q1)
high = q3 + 1.5 * (q3 - q1)
clean = []
for i in data:
if i > high:
i = high
elif i < low:
i = low
clean.append(i)Normalization : Min‑max, Z‑score, and Decimal scaling.
(data - data.min()) / (data.max() - data.min()) # Min‑max
(data - data.mean()) / data.std() # Z‑score
data / 10**np.ceil(np.log10(data.abs().max())) # Decimal scalingContinuous variable binning : equal‑width and equal‑frequency binning using pd.cut.
# Equal‑width binning
bins = [0,100,200,300,500,700,900,1100,1300, max(df['data'])]
df['col'] = pd.cut(df['data'], bins, right=True, labels=range(1,10))
# Equal‑frequency binning
k = 4
w = df['data'].quantile(np.arange(0,1+1/k,1/k))
df['col'] = pd.cut(df['data'], w, right=True, labels=range(1,5))Knee point detection (e.g., elbow method for K‑means) using the kneed package.
from kneed import KneeLocator
x = np.arange(1,31)
y = [...] # metric values
kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing')
print(f'Knee point at x = {kneedle.elbow}')Correlation coefficient calculation with DataFrame.corr() and interpretation of significance.
corr_matrix = df.corr()
print(corr_matrix['pay_ord_cnt'])
print(df['pay_ord_cnt'].corr(df['pay_ord_amt']))Chi‑square test for categorical variables using scipy.stats.chi2_contingency.
from scipy.stats import chi2_contingency
obs = np.array([[50,49,35],[150,100,90],[60,80,100]])
chi2, p, dof, expected = chi2_contingency(obs)
print(f'chi2={chi2:.4f}, p={p:.4f}, dof={dof}')
# p < 0.01 → reject H0Linear regression via numpy.polyfit (other options such as sklearn.linear_model.LinearRegression are mentioned).
import numpy as np
x = np.arange(1, len(y)+1)
coeff = np.polyfit(x, y, deg=1)
print(coeff[0]) # slopeWilson score for ranking items with small sample sizes.
pos = float(input_data.split(',')[0])
total = float(input_data.split(',')[1])
z = 1.96
p_hat = pos / total
score = (p_hat + z**2/(2*total) - z*np.sqrt((p_hat*(1-p_hat)+z**2/(4*total))/total)) / (1 + z**2/total)
print(score)PCA‑based weight calculation for multi‑metric scoring.
from sklearn.decomposition import PCA
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler().fit(df)
X = pd.DataFrame(scaler.transform(df))
pca = PCA()
pca.fit(X)
components = pca.components_ / np.sqrt(pca.explained_variance_.reshape(-1,1))
# compute weighted scores …Text processing : tokenization with jieba, frequency counting, TF‑IDF / TextRank keyword extraction, and sentiment analysis using SnowNLP (Chinese) and TextBlob (English).
import jieba, collections
words = jieba.cut(text)
freq = collections.Counter(words)
# TF‑IDF
from jieba import analyse
keywords = analyse.extract_tags(text, topK=20)
# Sentiment (Chinese)
from snownlp import SnowNLP
s = SnowNLP('颜色很好看')
print(s.sentiments)
# Sentiment (English)
from textblob import TextBlob
print(TextBlob('The product is great').sentiment)Visualization libraries : brief overview of matplotlib and seaborn with reference links.
Big‑data integration : reading MaxCompute (ODPS) tables with pyodps and developing Python UDFs for the platform.
from odps import ODPS
odps = ODPS('AccessId','AccessKey','project',endpoint='http://service-corp.odps.aliyun-inc.com/api')
sql = 'SELECT * FROM project.table;'
df = odps.execute_sql(sql).open_reader(tunnel=True).to_pandas()
# UDF example
@annotate('*->float')
class PolyfitUDF(object):
def __init__(self):
include_package_path('numpy.zip')
def evaluate(self, y):
import numpy as np
from numpy import polyfit
x = list(range(1, len(y)+1))
return np.float(polyfit(x, [int(v) for v in y], 1)[0])The article concludes with a brief team introduction and links to further reading.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
