20 Advanced Statistical Techniques Every Data Scientist Must Master
This comprehensive guide introduces twenty essential advanced statistical methods—from Bayesian inference and maximum likelihood estimation to copulas and generalized additive models—explaining their concepts, real‑world use cases, and providing concise Python code examples so data scientists can confidently apply them to complex analytical problems.
Data science blends mathematics, statistics, computer science, and domain expertise to extract insights from data. While machine‑learning algorithms often dominate discussions, a solid foundation in advanced statistical methods is equally crucial. The following sections present twenty such techniques, each with a brief explanation, practical example, and a runnable Python snippet.
1. Bayesian Inference
Bayesian inference uses Bayes’ theorem to update the probability of a hypothesis as new evidence arrives, allowing prior beliefs to be combined with observed data.
Example use case: Spam filtering – combine a prior belief that an email is spam with word evidence to compute the updated spam probability.
Code snippet:
!pip install pymc3
import pymc3 as pm
import numpy as np
observed_heads = 12
observed_tails = 8
with pm.Model() as model:
theta = pm.Beta('theta', alpha=1, beta=1) # Prior for coin bias
y = pm.Binomial('y', n=observed_heads + observed_tails, p=theta, observed=observed_heads)
trace = pm.sample(2000, tune=1000, cores=1, chains=2)
pm.summary(trace)2. Maximum Likelihood Estimation (MLE)
MLE finds the parameter values that maximize the likelihood of the observed data under a specified statistical model.
Example use case: Distribution fitting – estimate the mean and variance of a normal distribution that best fits the data.
Code snippet:
import numpy as np
from scipy.stats import norm
np.random.seed(42)
# Generate synthetic data
data = np.random.normal(loc=5, scale=2, size=1000)
mu_hat, std_hat = norm.fit(data)
print(f"Estimated mean (mu): {mu_hat:.2f}")
print(f"Estimated std (sigma): {std_hat:.2f}")3. Hypothesis Testing (t‑test)
A t‑test evaluates whether the means of two groups differ significantly, based on a null hypothesis of no difference.
Example use case: A/B testing – determine whether a new website layout (B) leads to a statistically different average session time compared to the old layout (A).
Code snippet:
import numpy as np
from scipy.stats import ttest_ind
np.random.seed(42)
# Synthetic data for two groups
group_A = np.random.normal(5, 1, 50)
group_B = np.random.normal(5.5, 1.2, 50)
stat, pvalue = ttest_ind(group_A, group_B)
print(f"T statistic: {stat:.2f}, p‑value: {pvalue:.4f}")
if pvalue < 0.05:
print("Reject null hypothesis (significant difference).")
else:
print("Fail to reject null hypothesis (no significant difference).")4. Analysis of Variance (ANOVA)
ANOVA tests whether there are statistically significant differences among the means of three or more groups.
Example use case: Marketing experiment – evaluate three advertising strategies by measuring sales uplift.
Code snippet:
import numpy as np
from scipy.stats import f_oneway
np.random.seed(42)
group1 = np.random.normal(10, 2, 30)
group2 = np.random.normal(12, 2, 30)
group3 = np.random.normal(14, 2, 30)
F, p = f_oneway(group1, group2, group3)
print(f"F statistic: {F:.2f}, p‑value: {p:.4f}")5. Principal Component Analysis (PCA)
PCA reduces dimensionality by projecting data onto orthogonal axes (principal components) that capture the most variance.
Example use case: Image compression – reduce high‑dimensional pixel data to a few features for faster processing.
Code snippet:
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(42)
X = np.random.rand(100, 10) # 100 samples, 10 features
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Reduced shape:", X_reduced.shape)6. Factor Analysis
Factor analysis models observed variables as linear combinations of latent (unobserved) factors, useful for dimensionality reduction or uncovering hidden structure.
Example use case: Psychometrics – identify underlying personality traits from questionnaire responses.
Code snippet:
!pip install factor_analyzer
import numpy as np
from factor_analyzer import FactorAnalyzer
np.random.seed(42)
X = np.random.rand(100, 6) # 100 samples, 6 variables
fa = FactorAnalyzer(n_factors=2, rotation='varimax')
fa.fit(X)
print("Factor loadings:
", fa.loadings_)7. K‑Means Clustering
K‑means partitions data into homogeneous groups (clusters) based on similarity to cluster centroids.
Example use case: Customer segmentation – group customers by purchasing patterns.
Code snippet:
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(42)
X = np.random.rand(200, 2) # 200 samples, 2‑D
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
print("Cluster centers:", kmeans.cluster_centers_)
print("First 10 labels:", kmeans.labels_[:10])8. Bootstrapping
Bootstrapping repeatedly samples with replacement from a dataset to estimate the distribution (and uncertainty) of a statistic such as the mean.
Example use case: Confidence interval – compute a 95 % confidence interval for the mean of a small dataset.
Code snippet:
import numpy as np
np.random.seed(42)
data = np.random.normal(50, 5, size=100)
def bootstrap_mean_ci(data, n_bootstraps=1000, ci=95):
means = []
n = len(data)
for _ in range(n_bootstraps):
sample = np.random.choice(data, size=n, replace=True)
means.append(np.mean(sample))
lower = np.percentile(means, (100 - ci) / 2)
upper = np.percentile(means, 100 - (100 - ci) / 2)
return np.mean(means), (lower, upper)
mean_est, (low, high) = bootstrap_mean_ci(data)
print(f"Bootstrap mean estimate: {mean_est:.2f}")
print(f"95% CI: [{low:.2f}, {high:.2f}]")9. Time‑Series Analysis (ARIMA)
ARIMA (AutoRegressive Integrated Moving Average) models a single‑variable time series by capturing autocorrelation, differencing, and moving‑average components.
Example use case: Sales forecasting – predict future sales based on historical performance.
Code snippet:
!pip install statsmodels
import numpy as np
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
np.random.seed(42)
data = np.random.normal(100, 5, 50)
time_series = pd.Series(data)
model = ARIMA(time_series, order=(1, 1, 1))
model_fit = model.fit()
forecast = model_fit.forecast(steps=5)
print("Forecast values:", forecast.values)10. Survival Analysis
Survival analysis deals with time‑to‑event data, focusing on the probability that an event (e.g., customer churn) occurs after a certain time.
Example use case: Customer churn – estimate how long a subscriber remains active before cancelling.
Code snippet:
!pip install lifelines
import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter
np.random.seed(42)
durations = np.random.exponential(scale=10, size=100)
events = np.random.binomial(1, 0.8, size=100)
kmf = KaplanMeierFitter()
kmf.fit(durations, event_observed=events, label='Test Group')
kmf.plot_survival_function()11. Multiple Linear Regression
Multiple linear regression models the relationship between a dependent variable and several independent variables.
Example use case: Pricing model – predict house prices based on area, number of rooms, and location.
Code snippet:
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(42)
rooms = np.random.randint(1, 5, 100)
sqft = np.random.randint(500, 2500, 100)
price = 100 + 2 * rooms + 0.5 * sqft + np.random.normal(0, 50, 100)
X = np.column_stack([rooms, sqft])
y = price
model = LinearRegression()
model.fit(X, y)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)12. Ridge & Lasso Regression
Ridge and Lasso add L2 and L1 regularization respectively to linear regression, helping to prevent over‑fitting and perform feature selection.
Code snippet:
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
np.random.seed(42)
X = np.random.rand(100, 10)
y = X[:,0]*5 + X[:,1]*3 + np.random.normal(0, 0.1, 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
ridge = Ridge(alpha=1.0).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
print("Ridge coefficients:", ridge.coef_)
print("Lasso coefficients:", lasso.coef_)13. Logistic Regression
Logistic regression models the probability of a binary outcome, making it suitable for classification tasks.
Example use case: Credit‑card fraud detection – classify transactions as fraudulent or legitimate.
Code snippet:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
np.random.seed(42)
X = np.random.rand(100, 5)
y = np.random.randint(0, 2, 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)14. Mixed‑Effects Models
Mixed‑effects (or hierarchical) models combine fixed effects (common to all groups) with random effects (specific to each group), useful for longitudinal or grouped data.
Example use case: Education data – model student scores across schools, allowing each school its own intercept.
Code snippet:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(42)
school_ids = np.repeat(np.arange(10), 20)
scores = 50 + 5 * np.random.rand(200) + 2 * school_ids + np.random.normal(0, 5, 200)
df = pd.DataFrame({"score": scores, "school": school_ids})
model = smf.mixedlm("score ~ 1", df, groups=df["school"])
result = model.fit()
print(result.summary())15. Non‑Parametric Test (Mann‑Whitney U)
The Mann‑Whitney U test compares two independent samples without assuming a specific distribution.
Example use case: Median comparison – compare median sales of two stores without assuming normality.
Code snippet:
import numpy as np
from scipy.stats import mannwhitneyu
np.random.seed(42)
group_A = np.random.exponential(scale=1.0, size=30)
group_B = np.random.exponential(scale=1.2, size=30)
stat, pvalue = mannwhitneyu(group_A, group_B, alternative='two-sided')
print(f"Statistic: {stat:.2f}, p‑value: {pvalue:.4f}")16. Monte Carlo Simulation
Monte Carlo simulation estimates the probability of different outcomes by repeatedly random sampling, useful for uncertainty quantification.
Example use case: Risk analysis – estimate the probability of project cost overruns given uncertain labor and material costs.
Code snippet:
import numpy as np
np.random.seed(42)
# Estimate π using random points in a unit square
n_samples = 1_000_000
xs = np.random.rand(n_samples)
ys = np.random.rand(n_samples)
inside = (xs**2 + ys**2) <= 1.0
pi_est = inside.sum() * 4 / n_samples
print("Estimated π:", pi_est)17. Markov Chain Monte Carlo (MCMC)
MCMC methods (e.g., Metropolis‑Hastings, Gibbs sampling) generate samples from a posterior distribution when direct sampling is infeasible, enabling Bayesian inference for complex models.
Example use case: Parameter estimation in hierarchical Bayesian models.
Code snippet:
import pymc3 as pm
import numpy as np
np.random.seed(42)
data = np.random.normal(0, 1, 100)
with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=1)
likelihood = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
trace = pm.sample(1000, tune=500, chains=2)
pm.summary(trace)18. Robust Regression
Robust regression methods (e.g., RANSAC, Huber) reduce sensitivity to outliers compared with ordinary least squares.
Example use case: Modeling financial data that contains extreme outliers.
Code snippet:
import numpy as np
from sklearn.linear_model import RANSACRegressor, LinearRegression
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 3 * X.squeeze() + 2 + np.random.normal(0, 2, 100)
# Add outliers
X_out = np.array([[8], [9], [9.5]])
y_out = np.array([50, 55, 60])
X = np.vstack((X, X_out))
y = np.concatenate((y, y_out))
ransac = RANSACRegressor(base_estimator=LinearRegression(), max_trials=100)
ransac.fit(X, y)
print("RANSAC coefficient:", ransac.estimator_.coef_)
print("RANSAC intercept:", ransac.estimator_.intercept_)19. Copulas
Copulas capture the dependence structure between random variables separately from their marginal distributions and are widely used in finance to model joint asset returns.
Example use case: Portfolio risk – model the joint behavior of multiple stocks that exhibit nonlinear dependence.
Code snippet:
!pip install copulas
import numpy as np
from copulas.multivariate import GaussianMultivariate
np.random.seed(42)
# Generate correlated synthetic data
X = np.random.normal(0, 1, (1000, 2))
X[:,1] = 0.8 * X[:,0] + np.random.normal(0, 0.6, 1000)
model = GaussianMultivariate()
model.fit(X)
sample = model.sample(5)
print("Original correlation:", np.corrcoef(X[:,0], X[:,1])[0,1])
print("Sample correlation:", np.corrcoef(sample[:,0], sample[:,1])[0,1])20. Generalized Additive Models (GAMs)
GAMs extend linear models by allowing each predictor to have a smooth, possibly nonlinear, function while preserving additivity, offering flexibility and interpretability.
Example use case: Health data – model patient outcomes as smooth functions of age and other covariates.
Code snippet:
!pip install pygam
import numpy as np
from pygam import LinearGAM, s
np.random.seed(42)
X = np.random.rand(200, 1) * 10
y = 2 + 3 * np.sin(X).ravel() + np.random.normal(0, 0.5, 200)
gam = LinearGAM(s(0)).fit(X, y)
print(gam.summary())From Bayesian inference and MLE to copulas and GAMs, these twenty advanced statistical methods form a comprehensive toolbox for any data scientist. The provided code snippets are minimal examples; each method can be explored in depth to tackle complex real‑world problems such as prediction, inference, and modeling intricate relationships.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
