Data Preprocessing and Statistical Analysis Techniques in Python
The article reviews essential Python data‑preprocessing and statistical‑analysis tools—including missing‑value imputation, outlier trimming, scaling, binning, knee‑point detection, correlation, chi‑square testing, linear regression, Wilson scoring, PCA weighting, text tokenization and sentiment analysis, plus visualization with matplotlib/seaborn and big‑data access via pyodps.
This article introduces common data‑preprocessing and statistical‑analysis methods used in data‑science projects, focusing on practical Python implementations.
Missing‑value handling : mean/median/mode imputation, fixed‑value fill, nearest‑neighbor, regression, and row deletion. Example:
# Check for missing values print(data.info(), '\n') # Drop rows with any missing value data2 = data.dropna(axis=0)
Outlier handling : IQR‑based trimming and capping.
data = np.array(data) q1 = np.quantile(data, 0.25) q3 = np.quantile(data, 0.75) low = q1 - 1.5 * (q3 - q1) high = q3 + 1.5 * (q3 - q1) clean = [] for i in data: if i > high: i = high elif i < low: i = low clean.append(i)
Normalization : Min‑max, Z‑score, and Decimal scaling.
(data - data.min()) / (data.max() - data.min()) # Min‑max (data - data.mean()) / data.std() # Z‑score data / 10**np.ceil(np.log10(data.abs().max())) # Decimal scaling
Continuous variable binning : equal‑width and equal‑frequency binning using pd.cut .
# Equal‑width binning bins = [0,100,200,300,500,700,900,1100,1300, max(df['data'])] df['col'] = pd.cut(df['data'], bins, right=True, labels=range(1,10)) # Equal‑frequency binning k = 4 w = df['data'].quantile(np.arange(0,1+1/k,1/k)) df['col'] = pd.cut(df['data'], w, right=True, labels=range(1,5))
Knee point detection (e.g., elbow method for K‑means) using the kneed package.
from kneed import KneeLocator x = np.arange(1,31) y = [...] # metric values kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing') print(f'Knee point at x = {kneedle.elbow}')
Correlation coefficient calculation with DataFrame.corr() and interpretation of significance.
corr_matrix = df.corr() print(corr_matrix['pay_ord_cnt']) print(df['pay_ord_cnt'].corr(df['pay_ord_amt']))
Chi‑square test for categorical variables using scipy.stats.chi2_contingency .
from scipy.stats import chi2_contingency obs = np.array([[50,49,35],[150,100,90],[60,80,100]]) chi2, p, dof, expected = chi2_contingency(obs) print(f'chi2={chi2:.4f}, p={p:.4f}, dof={dof}') # p < 0.01 → reject H0
Linear regression via numpy.polyfit (other options such as sklearn.linear_model.LinearRegression are mentioned).
import numpy as np x = np.arange(1, len(y)+1) coeff = np.polyfit(x, y, deg=1) print(coeff[0]) # slope
Wilson score for ranking items with small sample sizes.
pos = float(input_data.split(',')[0]) total = float(input_data.split(',')[1]) z = 1.96 p_hat = pos / total score = (p_hat + z**2/(2*total) - z*np.sqrt((p_hat*(1-p_hat)+z**2/(4*total))/total)) / (1 + z**2/total) print(score)
PCA‑based weight calculation for multi‑metric scoring.
from sklearn.decomposition import PCA from sklearn import preprocessing scaler = preprocessing.MinMaxScaler().fit(df) X = pd.DataFrame(scaler.transform(df)) pca = PCA() pca.fit(X) components = pca.components_ / np.sqrt(pca.explained_variance_.reshape(-1,1)) # compute weighted scores …
Text processing : tokenization with jieba , frequency counting, TF‑IDF / TextRank keyword extraction, and sentiment analysis using SnowNLP (Chinese) and TextBlob (English).
import jieba, collections words = jieba.cut(text) freq = collections.Counter(words) # TF‑IDF from jieba import analyse keywords = analyse.extract_tags(text, topK=20) # Sentiment (Chinese) from snownlp import SnowNLP s = SnowNLP('颜色很好看') print(s.sentiments) # Sentiment (English) from textblob import TextBlob print(TextBlob('The product is great').sentiment)
Visualization libraries : brief overview of matplotlib and seaborn with reference links.
Big‑data integration : reading MaxCompute (ODPS) tables with pyodps and developing Python UDFs for the platform.
from odps import ODPS odps = ODPS('AccessId','AccessKey','project',endpoint='http://service-corp.odps.aliyun-inc.com/api') sql = 'SELECT * FROM project.table;' df = odps.execute_sql(sql).open_reader(tunnel=True).to_pandas() # UDF example @annotate('*->float') class PolyfitUDF(object): def __init__(self): include_package_path('numpy.zip') def evaluate(self, y): import numpy as np from numpy import polyfit x = list(range(1, len(y)+1)) return np.float(polyfit(x, [int(v) for v in y], 1)[0])
The article concludes with a brief team introduction and links to further reading.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.