How to Generate Realistic Synthetic Data with Histograms and GMMs
This article explains two practical techniques—histogram‑based per‑column synthesis and Gaussian‑Mixture‑Model generation—for creating large, privacy‑preserving synthetic datasets that retain the statistical distributions and inter‑column relationships of the original data, and shows how to evaluate their quality.
Why synthetic data?
Real datasets are often too small or cannot be shared because of privacy regulations, yet they contain valuable statistical patterns needed for model development and testing. Synthetic data aims to reproduce these patterns while ensuring that individual records cannot be traced back to real people.
Method 1: Simple histogram‑based generation
The most straightforward approach creates each cell independently. For each column we first model its marginal distribution and then draw new values from that model.
Steps :
Identify categorical columns and compute the frequency of each unique value.
For numeric columns, build a histogram (e.g., 15 bins) and treat the bin heights as probabilities.
Generate rows by sampling a categorical value according to its frequency and a numeric value by first picking a bin according to its probability and then adding a small random jitter inside the bin.
Example code (using the OpenML baseball dataset) demonstrates the full pipeline:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
# Load data
data = fetch_openml('baseball', version=1, parser='auto')
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Strikeouts'] = df['Strikeouts'].fillna(df['Strikeouts'].median())
# Separate features
cat_features = [c for c in df.columns if df[c].nunique() <= 10]
num_features = [c for c in df.columns if c not in cat_features]
synth_data = []
# Numeric columns – histogram + jitter
for col in num_features:
hist = np.histogram(df[col], density=True)
bin_centers = [(a+b)/2 for a, b in zip(hist[1][:-1], hist[1][1:])
probs = hist[0] / hist[0].sum()
vals = np.random.choice(bin_centers, p=probs, size=len(df)).astype(int)
vals = [v + ((np.random.random() - 0.5) * df[col].std()) for v in vals]
synth_data.append(vals)
# Categorical columns – frequency sampling
for col in cat_features:
vc = df[col].value_counts(normalize=True)
vals = np.random.choice(vc.index, p=vc.values, size=len(df))
synth_data.append(vals)
synth_df = pd.DataFrame(synth_data).T
synth_df.columns = num_features + cat_featuresThis method preserves marginal distributions but often breaks inter‑column relationships, producing unrealistic row combinations (e.g., mismatched department‑ID pairs).
Method 2: Column‑by‑column generation with predictive models
To keep dependencies between columns, we generate data sequentially from left to right, training a model for each column that predicts its values from the already‑generated columns.
Start with the leftmost column and sample it as in Method 1.
For each subsequent column, fit a RandomForestRegressor (numeric) or RandomForestClassifier (categorical) on the real data using the previously generated columns as features.
Predict values for the synthetic rows, add a small jitter for diversity, and append the new column.
Key code fragment:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
synth_data = []
feature_0 = df.columns[0]
# Sample first column (histogram + jitter) – same as before
# ... (code omitted for brevity) ...
synth_cols = [feature_0]
for col in df.columns[1:]:
synth_df = pd.DataFrame(synth_data).T
synth_df.columns = synth_cols
if col in num_features:
model = RandomForestRegressor()
else:
model = RandomForestClassifier()
model.fit(df[synth_cols], df[col])
pred = model.predict(synth_df[synth_cols])
# Add jitter for numeric columns
if col in num_features:
pred = [p + ((np.random.random() - 0.5) * np.std(pred)) for p in pred]
synth_data.append(pred)
synth_cols.append(col)
synth_df = pd.DataFrame(synth_data).T
synth_df.columns = synth_cols
# Visual check of joint distribution
sns.scatterplot(data=df, x='At_bats', y='RBIs', color='blue', alpha=0.1)
sns.scatterplot(data=synth_df, x='At_bats', y='RBIs', color='red', marker='*', s=200)
plt.show()This approach preserves multivariate relationships much better, though the synthetic data may still lack diversity if the model is too strong.
Method 3: Gaussian‑Mixture‑Model (GMM) generation
GMM fits a mixture of multivariate Gaussian components to the real data, capturing both marginal distributions and covariances. After fitting, synthetic rows are drawn by sampling a component according to its weight and then sampling from its Gaussian distribution.
Typical workflow:
Pre‑process: one‑hot encode categorical variables and optionally remove extreme outliers (e.g., with IsolationForest).
Choose the number of components by minimizing BIC.
Fit GaussianMixture on the cleaned data.
Generate samples with gmm.sample(n_samples).
If needed, decode the one‑hot columns back to original categories.
Example code:
from sklearn.mixture import GaussianMixture
from sklearn.ensemble import IsolationForest
import seaborn as sns
# Load and clean data
data = fetch_openml('baseball', version=1, parser='auto')
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Strikeouts'] = df['Strikeouts'].fillna(df['Strikeouts'].median())
df = pd.get_dummies(df)
# Remove strong outliers
iso = IsolationForest()
iso.fit(df)
score = iso.decision_function(df)
trimmed = df.loc[np.argsort(score)[50:]]
# Find best number of components
best_n, best_bic = None, np.inf
for k in range(2, 10):
gmm = GaussianMixture(n_components=k)
gmm.fit(trimmed)
bic = gmm.bic(trimmed)
if bic < best_bic:
best_bic, best_n = bic, k
gmm = GaussianMixture(n_components=best_n)
gmm.fit(trimmed)
synthetic, _ = gmm.sample(n_samples=500)
synth_df = pd.DataFrame(synthetic, columns=trimmed.columns)
# Plot joint distribution
sns.scatterplot(data=trimmed, x='At_bats', y='RBIs', color='blue', alpha=0.1)
sns.scatterplot(data=synth_df, x='At_bats', y='RBIs', color='red', marker='*', s=200)
plt.show()GMM‑generated data retains both marginal and joint statistics very well. By inflating the covariance matrices (e.g., multiplying by a factor) one can deliberately increase variability to produce more outlier‑rich test sets.
Evaluating synthetic data quality
Typical checks include:
Exploratory data analysis (visual comparison of marginal and joint distributions).
Training a strong classifier (e.g., CatBoost) to distinguish real from synthetic rows; low accuracy indicates high similarity.
Running anomaly‑detection models on both datasets to see whether synthetic data contains unexpected patterns.
Conclusion
The article presented two easy‑to‑implement synthetic‑data pipelines: (1) per‑cell histogram sampling with optional Random‑Forest conditioning, and (2) full‑distribution modeling with Gaussian‑Mixture‑Models. Both methods are fast, rely on widely‑available Python libraries, and produce data that is sufficiently realistic for most development and testing scenarios, while still allowing control over diversity and privacy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
