Artificial Intelligence 17 min read

Data Preprocessing and Statistical Analysis Techniques in Python

The article reviews essential Python data‑preprocessing and statistical‑analysis tools—including missing‑value imputation, outlier trimming, scaling, binning, knee‑point detection, correlation, chi‑square testing, linear regression, Wilson scoring, PCA weighting, text tokenization and sentiment analysis, plus visualization with matplotlib/seaborn and big‑data access via pyodps.

DaTaobao Tech

Feb 24, 2023

Data Preprocessing and Statistical Analysis Techniques in Python

This article introduces common data‑preprocessing and statistical‑analysis methods used in data‑science projects, focusing on practical Python implementations.

Missing‑value handling : mean/median/mode imputation, fixed‑value fill, nearest‑neighbor, regression, and row deletion. Example:

# Check for missing values
print(data.info(), '
')
# Drop rows with any missing value
data2 = data.dropna(axis=0)

Outlier handling : IQR‑based trimming and capping.

data = np.array(data)
q1 = np.quantile(data, 0.25)
q3 = np.quantile(data, 0.75)
low = q1 - 1.5 * (q3 - q1)
high = q3 + 1.5 * (q3 - q1)
clean = []
for i in data:
    if i > high:
        i = high
    elif i < low:
        i = low
    clean.append(i)

Normalization : Min‑max, Z‑score, and Decimal scaling.

(data - data.min()) / (data.max() - data.min())   # Min‑max
(data - data.mean()) / data.std()                     # Z‑score
data / 10**np.ceil(np.log10(data.abs().max()))       # Decimal scaling

Continuous variable binning : equal‑width and equal‑frequency binning using pd.cut.

# Equal‑width binning
bins = [0,100,200,300,500,700,900,1100,1300, max(df['data'])]
df['col'] = pd.cut(df['data'], bins, right=True, labels=range(1,10))
# Equal‑frequency binning
k = 4
w = df['data'].quantile(np.arange(0,1+1/k,1/k))
df['col'] = pd.cut(df['data'], w, right=True, labels=range(1,5))

Knee point detection (e.g., elbow method for K‑means) using the kneed package.

from kneed import KneeLocator
x = np.arange(1,31)
y = [...]  # metric values
kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing')
print(f'Knee point at x = {kneedle.elbow}')

Correlation coefficient calculation with DataFrame.corr() and interpretation of significance.

corr_matrix = df.corr()
print(corr_matrix['pay_ord_cnt'])
print(df['pay_ord_cnt'].corr(df['pay_ord_amt']))

Chi‑square test for categorical variables using scipy.stats.chi2_contingency.

from scipy.stats import chi2_contingency
obs = np.array([[50,49,35],[150,100,90],[60,80,100]])
chi2, p, dof, expected = chi2_contingency(obs)
print(f'chi2={chi2:.4f}, p={p:.4f}, dof={dof}')
# p < 0.01 → reject H0

Linear regression via numpy.polyfit (other options such as sklearn.linear_model.LinearRegression are mentioned).

import numpy as np
x = np.arange(1, len(y)+1)
coeff = np.polyfit(x, y, deg=1)
print(coeff[0])  # slope

Wilson score for ranking items with small sample sizes.

pos = float(input_data.split(',')[0])
total = float(input_data.split(',')[1])
z = 1.96
p_hat = pos / total
score = (p_hat + z**2/(2*total) - z*np.sqrt((p_hat*(1-p_hat)+z**2/(4*total))/total)) / (1 + z**2/total)
print(score)

PCA‑based weight calculation for multi‑metric scoring.

from sklearn.decomposition import PCA
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler().fit(df)
X = pd.DataFrame(scaler.transform(df))
pca = PCA()
pca.fit(X)
components = pca.components_ / np.sqrt(pca.explained_variance_.reshape(-1,1))
# compute weighted scores …

Text processing : tokenization with jieba, frequency counting, TF‑IDF / TextRank keyword extraction, and sentiment analysis using SnowNLP (Chinese) and TextBlob (English).

import jieba, collections
words = jieba.cut(text)
freq = collections.Counter(words)
# TF‑IDF
from jieba import analyse
keywords = analyse.extract_tags(text, topK=20)
# Sentiment (Chinese)
from snownlp import SnowNLP
s = SnowNLP('颜色很好看')
print(s.sentiments)
# Sentiment (English)
from textblob import TextBlob
print(TextBlob('The product is great').sentiment)

Visualization libraries : brief overview of matplotlib and seaborn with reference links.

Big‑data integration : reading MaxCompute (ODPS) tables with pyodps and developing Python UDFs for the platform.

from odps import ODPS
odps = ODPS('AccessId','AccessKey','project',endpoint='http://service-corp.odps.aliyun-inc.com/api')
sql = 'SELECT * FROM project.table;'
df = odps.execute_sql(sql).open_reader(tunnel=True).to_pandas()
# UDF example
@annotate('*->float')
class PolyfitUDF(object):
    def __init__(self):
        include_package_path('numpy.zip')
    def evaluate(self, y):
        import numpy as np
        from numpy import polyfit
        x = list(range(1, len(y)+1))
        return np.float(polyfit(x, [int(v) for v in y], 1)[0])

The article concludes with a brief team introduction and links to further reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning Python statistical analysis visualization

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.