Artificial Intelligence 17 min read

Data Preprocessing and Statistical Analysis Techniques in Python

The article reviews essential Python data‑preprocessing and statistical‑analysis tools—including missing‑value imputation, outlier trimming, scaling, binning, knee‑point detection, correlation, chi‑square testing, linear regression, Wilson scoring, PCA weighting, text tokenization and sentiment analysis, plus visualization with matplotlib/seaborn and big‑data access via pyodps.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Data Preprocessing and Statistical Analysis Techniques in Python

This article introduces common data‑preprocessing and statistical‑analysis methods used in data‑science projects, focusing on practical Python implementations.

Missing‑value handling : mean/median/mode imputation, fixed‑value fill, nearest‑neighbor, regression, and row deletion. Example:

# Check for missing values print(data.info(), '\n') # Drop rows with any missing value data2 = data.dropna(axis=0)

Outlier handling : IQR‑based trimming and capping.

data = np.array(data) q1 = np.quantile(data, 0.25) q3 = np.quantile(data, 0.75) low = q1 - 1.5 * (q3 - q1) high = q3 + 1.5 * (q3 - q1) clean = [] for i in data: if i > high: i = high elif i < low: i = low clean.append(i)

Normalization : Min‑max, Z‑score, and Decimal scaling.

(data - data.min()) / (data.max() - data.min()) # Min‑max (data - data.mean()) / data.std() # Z‑score data / 10**np.ceil(np.log10(data.abs().max())) # Decimal scaling

Continuous variable binning : equal‑width and equal‑frequency binning using pd.cut .

# Equal‑width binning bins = [0,100,200,300,500,700,900,1100,1300, max(df['data'])] df['col'] = pd.cut(df['data'], bins, right=True, labels=range(1,10)) # Equal‑frequency binning k = 4 w = df['data'].quantile(np.arange(0,1+1/k,1/k)) df['col'] = pd.cut(df['data'], w, right=True, labels=range(1,5))

Knee point detection (e.g., elbow method for K‑means) using the kneed package.

from kneed import KneeLocator x = np.arange(1,31) y = [...] # metric values kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing') print(f'Knee point at x = {kneedle.elbow}')

Correlation coefficient calculation with DataFrame.corr() and interpretation of significance.

corr_matrix = df.corr() print(corr_matrix['pay_ord_cnt']) print(df['pay_ord_cnt'].corr(df['pay_ord_amt']))

Chi‑square test for categorical variables using scipy.stats.chi2_contingency .

from scipy.stats import chi2_contingency obs = np.array([[50,49,35],[150,100,90],[60,80,100]]) chi2, p, dof, expected = chi2_contingency(obs) print(f'chi2={chi2:.4f}, p={p:.4f}, dof={dof}') # p < 0.01 → reject H0

Linear regression via numpy.polyfit (other options such as sklearn.linear_model.LinearRegression are mentioned).

import numpy as np x = np.arange(1, len(y)+1) coeff = np.polyfit(x, y, deg=1) print(coeff[0]) # slope

Wilson score for ranking items with small sample sizes.

pos = float(input_data.split(',')[0]) total = float(input_data.split(',')[1]) z = 1.96 p_hat = pos / total score = (p_hat + z**2/(2*total) - z*np.sqrt((p_hat*(1-p_hat)+z**2/(4*total))/total)) / (1 + z**2/total) print(score)

PCA‑based weight calculation for multi‑metric scoring.

from sklearn.decomposition import PCA from sklearn import preprocessing scaler = preprocessing.MinMaxScaler().fit(df) X = pd.DataFrame(scaler.transform(df)) pca = PCA() pca.fit(X) components = pca.components_ / np.sqrt(pca.explained_variance_.reshape(-1,1)) # compute weighted scores …

Text processing : tokenization with jieba , frequency counting, TF‑IDF / TextRank keyword extraction, and sentiment analysis using SnowNLP (Chinese) and TextBlob (English).

import jieba, collections words = jieba.cut(text) freq = collections.Counter(words) # TF‑IDF from jieba import analyse keywords = analyse.extract_tags(text, topK=20) # Sentiment (Chinese) from snownlp import SnowNLP s = SnowNLP('颜色很好看') print(s.sentiments) # Sentiment (English) from textblob import TextBlob print(TextBlob('The product is great').sentiment)

Visualization libraries : brief overview of matplotlib and seaborn with reference links.

Big‑data integration : reading MaxCompute (ODPS) tables with pyodps and developing Python UDFs for the platform.

from odps import ODPS odps = ODPS('AccessId','AccessKey','project',endpoint='http://service-corp.odps.aliyun-inc.com/api') sql = 'SELECT * FROM project.table;' df = odps.execute_sql(sql).open_reader(tunnel=True).to_pandas() # UDF example @annotate('*->float') class PolyfitUDF(object): def __init__(self): include_package_path('numpy.zip') def evaluate(self, y): import numpy as np from numpy import polyfit x = list(range(1, len(y)+1)) return np.float(polyfit(x, [int(v) for v in y], 1)[0])

The article concludes with a brief team introduction and links to further reading.

big-dataMachine LearningPythondata preprocessingstatistical analysisvisualization
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.