Artificial Intelligence 12 min read

Statistical and Machine Learning Metrics for Data Analysis

The article presents a practical toolbox of statistical and machine‑learning metrics—including short‑term growth rates, CAGR, Excel forecasting functions, Wilson score adjustment, sigmoid decay weighting, correlation coefficients, KL divergence, elbow detection with KneeLocator, entropy‑based weighting, PCA, and TF‑IDF—offering concise formulas and code snippets for data analysis without deep theory.

DaTaobao Tech

May 22, 2023

Statistical and Machine Learning Metrics for Data Analysis

Statistics and machine learning provide the theoretical foundation for data analysis. This article shares practical “magic” metrics and methods without deep derivations.

Short‑term growth rate

General growth rate vs. relative ranking growth.

Mixed growth = GMV growth + ranking growth.

Weighted mixed growth = indicator growth × log(1+indicator).

Long‑term trend: CAGR

CAGR (Compound Annual Growth Rate) measures average growth over a period. Example: start = 5, end = 20, years = 2 → CAGR = 100%.

Time‑series forecasting

Three common Excel functions:

Forecast.linear() – linear regression based prediction.

Forecast.ets() – triple exponential smoothing for seasonal data.

Forecast.ets.seasonality() – detects seasonal period.

Forecast.ets.confint() – confidence interval of forecast.

Sample‑size imbalance: Wilson Score

Adjusts click‑through or conversion rates when sample sizes differ. Provides a lower bound of the Wilson interval.

from odps.udf import annotate
import numpy as np
@annotate('string->string')
class wilsonScore(object):
    # Wilson interval lower bound
    def evaluate(self,input_data):
        pos = float(input_data.split(',')[0])
        total = float(input_data.split(',')[1])
        p_z=1.96
        pos_rat = pos * 1. / total
        score = (pos_rat + (np.square(p_z) / (2. * total))
                 - ((p_z / (2. * total)) * np.sqrt(4. * total * (1. - pos_rat) * pos_rat + np.square(p_z)))) / \
                (1. + np.square(p_z) / total)
        return str(score)

Sigmoid decay function

Maps any real number to (0,1) to weight historical performance.

Correlation coefficients

Pearson (numeric, normal), Spearman (rank‑based), Kendall (concordance). Example calculations and interpretation are provided.

Kullback‑Leibler divergence

Quantifies difference between two probability distributions. Example distributions A and B are given.

KneeLocator (elbow detection)

from kneed import KneeLocator
import matplotlib.pyplot as plt
x = np.arange(1,31)
y = [0.492,0.615,0.625,0.665,0.718,0.762,0.800,0.832,0.859,0.880,0.899,0.914,0.927,0.939,0.949,0.957,0.964,0.970,0.976,0.980,0.984,0.987,0.990,0.993,0.994,0.996,0.997,0.998,0.999,0.999]
kneedle = KneeLocator(x, y, S=1.0, curve='concave', direction='increasing')
print(f'Elbow at x = {kneedle.elbow}')

Entropy method for weight determination

Steps: data standardization → compute information entropy → assign weights.

Principal Component Analysis (PCA)

Derives uncorrelated components that retain most variance. Shows how to obtain feature‑value tables and compute composite scores.

Other metrics

CRN (category‑rank‑N), consumption concentration, TF‑IDF for keyword scoring.

Conclusion: The “Data Storytelling” series documents the author’s growth in data science and invites readers to share and learn together.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

statistics data analysis correlation entropy forecasting PCA

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.