Statistical and Machine Learning Metrics for Data Analysis
The article presents a practical toolbox of statistical and machine‑learning metrics—including short‑term growth rates, CAGR, Excel forecasting functions, Wilson score adjustment, sigmoid decay weighting, correlation coefficients, KL divergence, elbow detection with KneeLocator, entropy‑based weighting, PCA, and TF‑IDF—offering concise formulas and code snippets for data analysis without deep theory.
Statistics and machine learning provide the theoretical foundation for data analysis. This article shares practical “magic” metrics and methods without deep derivations.
Short‑term growth rate
General growth rate vs. relative ranking growth.
Mixed growth = GMV growth + ranking growth.
Weighted mixed growth = indicator growth × log(1+indicator).
Long‑term trend: CAGR
CAGR (Compound Annual Growth Rate) measures average growth over a period. Example: start = 5, end = 20, years = 2 → CAGR = 100%.
Time‑series forecasting
Three common Excel functions:
Forecast.linear() – linear regression based prediction.
Forecast.ets() – triple exponential smoothing for seasonal data.
Forecast.ets.seasonality() – detects seasonal period.
Forecast.ets.confint() – confidence interval of forecast.
Sample‑size imbalance: Wilson Score
Adjusts click‑through or conversion rates when sample sizes differ. Provides a lower bound of the Wilson interval.
from odps.udf import annotate
import numpy as np
@annotate('string->string')
class wilsonScore(object):
# Wilson interval lower bound
def evaluate(self,input_data):
pos = float(input_data.split(',')[0])
total = float(input_data.split(',')[1])
p_z=1.96
pos_rat = pos * 1. / total
score = (pos_rat + (np.square(p_z) / (2. * total))
- ((p_z / (2. * total)) * np.sqrt(4. * total * (1. - pos_rat) * pos_rat + np.square(p_z)))) / \
(1. + np.square(p_z) / total)
return str(score)Sigmoid decay function
Maps any real number to (0,1) to weight historical performance.
Correlation coefficients
Pearson (numeric, normal), Spearman (rank‑based), Kendall (concordance). Example calculations and interpretation are provided.
Kullback‑Leibler divergence
Quantifies difference between two probability distributions. Example distributions A and B are given.
KneeLocator (elbow detection)
from kneed import KneeLocator
import matplotlib.pyplot as plt
x = np.arange(1,31)
y = [0.492,0.615,0.625,0.665,0.718,0.762,0.800,0.832,0.859,0.880,0.899,0.914,0.927,0.939,0.949,0.957,0.964,0.970,0.976,0.980,0.984,0.987,0.990,0.993,0.994,0.996,0.997,0.998,0.999,0.999]
kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing')
print(f'Elbow at x = {kneedle.elbow}')Entropy method for weight determination
Steps: data standardization → compute information entropy → assign weights.
Principal Component Analysis (PCA)
Derives uncorrelated components that retain most variance. Shows how to obtain feature‑value tables and compute composite scores.
Other metrics
CRN (category‑rank‑N), consumption concentration, TF‑IDF for keyword scoring.
Conclusion: The “Data Storytelling” series documents the author’s growth in data science and invites readers to share and learn together.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.